INDEX
Explanations
instances of self-referential or personal statements
New Auto-Interp
Negative Logits
strup
-0.16
lili
-0.14
Roller
-0.14
aries
-0.14
uali
-0.13
402
-0.13
elial
-0.13
force
-0.13
Biz
-0.13
386
-0.13
POSITIVE LOGITS
etch
0.16
ãģıãģł
0.15
ataka
0.15
主任
0.14
eya
0.14
ala
0.14
ãĥĭãĤ¢
0.14
ument
0.14
odied
0.13
ude
0.13
Activations Density 0.069%