INDEX
Explanations
mentions of specific proper nouns or brands
the word "One" and various pronouns and articles that suggest direct address or reference
New Auto-Interp
Negative Logits
slowing
-0.68
assetsadobe
-0.66
nose
-0.66
submar
-0.64
dece
-0.64
noses
-0.64
vain
-0.63
silenced
-0.63
length
-0.62
diver
-0.62
POSITIVE LOGITS
oran
0.89
lan
0.85
atson
0.85
zac
0.82
lins
0.79
alla
0.78
iba
0.78
chel
0.78
opolis
0.76
ussie
0.75
Activations Density 0.184%