INDEX
Explanations
definite articles followed by capitalized words
the word "the" and its variations as part of various phrases
New Auto-Interp
Negative Logits
without
-0.73
perse
-0.73
/"
-0.70
patiently
-0.69
alone
-0.68
iod
-0.67
—-
-0.67
--+
-0.66
eno
-0.65
according
-0.65
POSITIVE LOGITS
oret
1.61
resa
1.39
odore
1.31
orem
1.25
ories
1.24
atre
1.20
easiest
1.06
biggest
1.04
hardest
1.03
simplest
1.01
Activations Density 0.295%