INDEX
Explanations
words related to violence and hate crimes
references to violence and hate crimes
New Auto-Interp
Negative Logits
shire
-0.88
sonian
-0.79
gio
-0.74
Parables
-0.73
phrine
-0.70
hower
-0.70
Oops
-0.69
Guinness
-0.68
ROM
-0.68
Lunar
-0.67
POSITIVE LOGITS
intimidation
0.98
violence
0.98
retaliation
0.94
indiscrim
0.93
perpetrated
0.93
harassment
0.93
harass
0.92
retribution
0.89
persecution
0.89
slurs
0.85
Activations Density 0.357%