INDEX
Explanations
references to hate and hateful behavior
New Auto-Interp
Negative Logits
jam
-0.17
ooks
-0.17
nten
-0.15
çĽ
-0.15
umo
-0.15
ãĥ«ãĥķ
-0.14
á»
-0.14
theless
-0.14
onsense
-0.14
azzi
-0.14
POSITIVE LOGITS
pol
0.17
aad
0.15
ouser
0.15
IH
0.14
inger
0.14
beck
0.14
onian
0.14
emp
0.14
\-
0.13
å¸ĿåĽ½
0.13
Activations Density 0.006%