INDEX
Explanations
references to incidents of racism and hate crimes
New Auto-Interp
Negative Logits
rebel
-0.17
rebels
-0.17
Rebellion
-0.16
Rebels
-0.15
rebell
-0.15
rebellion
-0.15
sey
-0.14
ronic
-0.14
ako
-0.14
irez
-0.14
POSITIVE LOGITS
hate
0.44
Hate
0.39
hatred
0.32
_hat
0.30
hates
0.30
hateful
0.29
hat
0.29
Hat
0.28
Hat
0.28
hat
0.28
Activations Density 0.185%