INDEX
Explanations
references to hate crimes and criminal activities
New Auto-Interp
Negative Logits
icio
-0.81
ãĥĥãĥī
-0.81
é¾įåĸļ士
-0.74
bits
-0.73
ernand
-0.71
comings
-0.71
dit
-0.70
BUS
-0.68
indisp
-0.68
adh
-0.67
POSITIVE LOGITS
perpetrated
0.97
retaliation
0.90
prosecutions
0.89
spree
0.89
hotline
0.82
targeting
0.82
Victim
0.76
prevention
0.76
incidents
0.75
accusation
0.74
Activations Density 0.027%