INDEX
Explanations
phrases related to hate speech and hate crimes
references to hate and hate speech
New Auto-Interp
Negative Logits
ufact
-0.82
æ©Ł
-0.82
UNCH
-0.80
å§«
-0.79
aver
-0.76
idges
-0.73
Decre
-0.71
é¾įå
-0.71
Tablet
-0.70
ODE
-0.70
POSITIVE LOGITS
fulness
1.15
fully
1.12
crimes
0.96
ful
0.91
vengeance
0.86
hate
0.86
hate
0.79
retaliation
0.78
ãĥĨ
0.78
bre
0.78
Activations Density 0.017%