INDEX
Explanations
proper nouns or names, specifically those that start with a capital letter
empty passages or sections within the text
New Auto-Interp
Negative Logits
theirs
-0.67
beforehand
-0.65
.</
-0.65
themselves
-0.61
qi
-0.59
â̦â̦
-0.59
âĢij
-0.58
with
-0.58
Ãĥ
-0.58
����
-0.58
POSITIVE LOGITS
resa
1.45
odore
1.44
oret
1.36
ories
1.18
simplest
1.03
nce
1.00
atre
0.99
downside
0.97
latest
0.94
easiest
0.92
Activations Density 0.498%