INDEX
Explanations
proper nouns, especially names and titles
New Auto-Interp
Negative Logits
rette
-0.17
yre
-0.16
iet
-0.16
crest
-0.15
yr
-0.15
uros
-0.15
runner
-0.15
aram
-0.15
RS
-0.15
verte
-0.15
POSITIVE LOGITS
ksen
0.20
anged
0.19
ivery
0.18
ivative
0.17
angement
0.17
neÄŁi
0.17
fts
0.16
shire
0.16
uelle
0.15
iminal
0.15
Activations Density 0.018%