INDEX
Explanations
pronouns related to self-reference and identity
New Auto-Interp
Negative Logits
atch
-0.14
ibus
-0.14
mond
-0.14
زÙĬØ©
-0.14
illion
-0.14
illos
-0.14
.FontStyle
-0.14
lines
-0.14
ond
-0.14
leans
-0.13
POSITIVE LOGITS
-même
0.24
zelf
0.21
zÅij
0.16
ipsis
0.16
ĵ
0.15
376
0.15
ORT
0.14
362
0.14
enek
0.14
762
0.14
Activations Density 0.049%