INDEX
Explanations
references to hate crimes and violence against marginalized communities
New Auto-Interp
Negative Logits
abad
-0.19
Ross
-0.14
ãĤ¹ãĤ³
-0.14
moc
-0.14
ادÙħ
-0.13
Anc
-0.13
-Smith
-0.13
ROSS
-0.13
Cannon
-0.13
eneg
-0.13
POSITIVE LOGITS
toi
0.17
ä»
0.16
æ®
0.15
Journalism
0.15
hate
0.15
Acts
0.15
incel
0.14
акÑĤи
0.14
tainment
0.14
finity
0.14
Activations Density 0.061%