INDEX
Explanations
phrases indicating personal reflection or introspection
New Auto-Interp
Negative Logits
oden
-0.16
agra
-0.15
_gs
-0.15
dbuf
-0.15
aphore
-0.15
ائر
-0.14
alık
-0.14
.fd
-0.14
isty
-0.14
گاÙĩ
-0.14
POSITIVE LOGITS
Fem
0.17
Lad
0.17
lick
0.15
l
0.15
rough
0.15
sel
0.14
uring
0.14
zes
0.14
ZIP
0.14
ilm
0.14
Activations Density 0.000%