INDEX
Explanations
words related to distress or suffering
New Auto-Interp
Negative Logits
ãĥĨãĥ«
-0.16
Hlav
-0.16
ãģŁãģĹ
-0.16
ampire
-0.15
dux
-0.15
elter
-0.14
_locals
-0.14
allel
-0.14
631
-0.14
895
-0.14
POSITIVE LOGITS
Shen
0.16
iban
0.15
bian
0.14
Barcl
0.14
LM
0.14
Isl
0.14
heimer
0.14
bes
0.13
_lm
0.13
--------------------
0.13
Activations Density 0.039%