INDEX
Explanations
names ending in 'i'
the pronoun "I" and, relatedly, references to self or identity
New Auto-Interp
Negative Logits
pter
-0.77
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
-0.71
lisher
-0.70
cffff
-0.69
eatures
-0.68
imentary
-0.67
*/(
-0.65
ilater
-0.65
sburg
-0.65
lain
-0.64
POSITIVE LOGITS
Äĩ
1.13
orno
1.09
plom
1.00
ples
1.00
ère
0.99
ye
0.98
ota
0.96
oti
0.96
pling
0.93
uli
0.92
Activations Density 0.056%