INDEX
Explanations
self-referential statements and expressions of doubt or insecurity about one's identity
New Auto-Interp
Negative Logits
idor
-0.17
anke
-0.16
loth
-0.15
MOTE
-0.15
achat
-0.15
anker
-0.15
że
-0.14
Leer
-0.14
LEASE
-0.14
flop
-0.14
POSITIVE LOGITS
supposed
0.19
missing
0.19
alone
0.19
alone
0.18
headed
0.17
dense
0.17
Welcome
0.17
hereby
0.17
welcome
0.16
welcome
0.16
Activations Density 0.100%