INDEX
Explanations
instances of the words "tell" and "told."
New Auto-Interp
Negative Logits
езд
-0.16
ial
-0.15
olar
-0.15
ic
-0.14
esters
-0.14
bole
-0.14
zu
-0.14
utt
-0.14
Overrides
-0.14
estr
-0.13
POSITIVE LOGITS
tales
0.24
stories
0.22
ingly
0.21
tale
0.21
us
0.20
fortunes
0.19
told
0.19
lies
0.18
me
0.18
tell
0.17
Activations Density 0.045%