INDEX
Explanations
expressions of personal identity and self-reference
New Auto-Interp
Negative Logits
rvé
-0.15
odzi
-0.15
:↵↵↵↵↵↵
-0.13
imson
-0.13
ÙĦت
-0.13
firm
-0.13
окÑģи
-0.12
.future
-0.12
DBC
-0.12
ipmap
-0.12
POSITIVE LOGITS
dun
0.26
Dun
0.23
demand
0.20
mean
0.19
fucking
0.18
iiii
0.18
swear
0.17
kr
0.17
SA
0.17
retract
0.17
Activations Density 0.243%