INDEX
Explanations
phrases indicating personal identity or self-description
New Auto-Interp
Negative Logits
ences
-0.16
ison
-0.16
enci
-0.16
ako
-0.15
enia
-0.14
ence
-0.14
enson
-0.14
HQ
-0.14
Dav
-0.14
hq
-0.14
POSITIVE LOGITS
lix
0.15
ÑĦÑĸк
0.15
_________________↵↵
0.15
커ìĬ¤
0.14
èĢ
0.14
CCR
0.14
.cls
0.14
inish
0.14
(PyObject
0.14
auc
0.14
Activations Density 0.005%