INDEX
Explanations
words related to hatred and hate crimes
references to hate crimes and hate speech
New Auto-Interp
Negative Logits
ufact
-0.80
æ©Ł
-0.80
aver
-0.76
idges
-0.73
Decre
-0.73
clinton
-0.71
é¾įå
-0.71
ioned
-0.71
Tablet
-0.71
UNCH
-0.70
POSITIVE LOGITS
fulness
1.13
fully
1.11
crimes
1.01
vengeance
0.91
ful
0.89
hate
0.84
bre
0.82
prejudice
0.82
hate
0.80
crime
0.79
Activations Density 0.023%