INDEX
Explanations
instances of emotional manipulation or coercion
New Auto-Interp
Negative Logits
Yue
-0.15
antar
-0.15
iti
-0.14
licht
-0.14
åº
-0.14
pha
-0.14
_kses
-0.14
ãĥĥãĥĦ
-0.13
xFFF
-0.13
itu
-0.13
POSITIVE LOGITS
loff
0.15
lẫn
0.15
ype
0.14
ỡ
0.14
Eh
0.14
nowled
0.14
åĿ¡
0.14
ysz
0.13
rance
0.13
strup
0.12
Activations Density 0.010%