INDEX
Explanations
proper nouns, particularly names of individuals and places
New Auto-Interp
Negative Logits
steder
-0.16
asad
-0.16
duto
-0.15
reesome
-0.15
aget
-0.15
keley
-0.15
elow
-0.14
ariate
-0.14
ecies
-0.14
udio
-0.14
POSITIVE LOGITS
ern
0.35
arn
0.34
ERN
0.34
orn
0.33
urn
0.31
ern
0.30
horn
0.29
ORN
0.29
URN
0.29
ÙĪØ±ÙĨ
0.28
Activations Density 0.186%