INDEX
Explanations
the word "at" in various contexts within the text
New Auto-Interp
Negative Logits
none
-0.18
every
-0.17
NONE
-0.16
uction
-0.16
eld
-0.16
lew
-0.16
EVERY
-0.15
laus
-0.15
each
-0.15
both
-0.15
POSITIVE LOGITS
tall
0.20
ally
0.18
Raphael
0.17
Tall
0.16
ll
0.16
altogether
0.15
rawl
0.15
ally
0.15
skins
0.15
al
0.14
Activations Density 0.012%