INDEX
Explanations
references to violence and harmful ideologies, particularly relating to genocide and oppression
New Auto-Interp
Negative Logits
less
-0.06
ãĥĬ
-0.06
746
-0.05
yth
-0.05
628
-0.05
sparing
-0.05
a
-0.05
shop
-0.05
-less
-0.05
aint
-0.05
POSITIVE LOGITS
avou
0.09
tuk
0.09
apesh
0.08
hoot
0.08
ersive
0.08
eryl
0.08
знаÑĩа
0.08
-pills
0.07
à¸Ļà¸Ħ
0.07
bard
0.07
Activations Density 0.072%