INDEX
Explanations
proper names of individuals
proper nouns, specifically names and titles
New Auto-Interp
Negative Logits
rics
-0.75
otide
-0.75
acted
-0.68
ounty
-0.67
arty
-0.67
lance
-0.66
rums
-0.65
river
-0.63
ding
-0.63
iddler
-0.63
POSITIVE LOGITS
chal
1.09
sel
0.98
chwitz
0.97
urance
0.96
pect
0.95
ques
0.94
sell
0.94
ocial
0.94
sembly
0.91
que
0.91
Activations Density 0.102%