INDEX
Explanations
the word "the" and variations of it
New Auto-Interp
Negative Logits
uent
-0.17
uild
-0.15
astically
-0.15
æ¯Ľ
-0.14
neau
-0.14
eree
-0.14
eness
-0.14
suming
-0.14
oms
-0.14
enef
-0.13
POSITIVE LOGITS
late
0.49
late
0.44
Late
0.36
Late
0.33
man
0.26
son
0.26
likes
0.24
estim
0.23
odore
0.22
incom
0.21
Activations Density 0.250%