INDEX
Explanations
phrases related to hate speech and hate crimes
terminology related to hate and hate crimes
New Auto-Interp
Negative Logits
aver
-0.80
UNCH
-0.73
Decre
-0.72
idges
-0.72
å§«
-0.71
Examination
-0.68
Pione
-0.68
interstitial
-0.68
ufact
-0.67
ODE
-0.67
POSITIVE LOGITS
fully
1.10
fulness
1.09
crimes
0.96
ful
0.88
vengeance
0.85
prejudice
0.82
speech
0.79
hound
0.78
crime
0.77
hate
0.76
Activations Density 0.030%