INDEX
Explanations
proper nouns, particularly names and titles
New Auto-Interp
Negative Logits
lej
-0.17
pite
-0.16
ÄĮer
-0.15
VERR
-0.15
arken
-0.14
rish
-0.14
argon
-0.14
ully
-0.14
ncmp
-0.14
Ìģc
-0.13
POSITIVE LOGITS
.,
0.15
S
0.14
ffd
0.14
Jones
0.14
å®
0.14
orf
0.14
/*/
0.14
cz
0.13
Maz
0.13
Sahara
0.13
Activations Density 0.228%