INDEX
Explanations
phrases related to things being off or outside of the norm
the determiner "the" in various contexts throughout the text
New Auto-Interp
Negative Logits
appreci
-0.68
cumbers
-0.59
diplom
-0.59
expel
-0.58
noticeably
-0.57
tremend
-0.57
Wond
-0.55
abundantly
-0.54
perhaps
-0.54
tonight
-0.53
POSITIVE LOGITS
atre
1.29
ory
1.22
mes
1.14
oret
1.05
ater
1.01
orem
0.94
aters
0.94
ATER
0.93
-
0.93
ories
0.90
Activations Density 0.047%