INDEX
Explanations
mentions of specific names and entities
capitalized proper nouns, particularly names and titles of entities
New Auto-Interp
Negative Logits
showc
-0.70
arching
-0.69
cort
-0.63
forth
-0.62
contrace
-0.62
psychiat
-0.61
horm
-0.59
forth
-0.58
Sylv
-0.57
embargo
-0.56
POSITIVE LOGITS
zees
0.83
ufact
0.79
kas
0.71
oola
0.71
culosis
0.70
emouth
0.69
åŃIJ
0.68
Beasts
0.67
rities
0.66
gat
0.66
Activations Density 0.303%