INDEX
Explanations
mentions of the word "who."
New Auto-Interp
Negative Logits
scan
-0.16
mente
-0.15
andon
-0.15
stick
-0.15
abbo
-0.15
icens
-0.15
type
-0.15
raf
-0.14
tti
-0.14
tt
-0.14
POSITIVE LOGITS
oping
0.30
oped
0.23
ever
0.20
ops
0.17
upon
0.17
soever
0.17
/if
0.16
onto
0.16
osh
0.15
’d
0.15
Activations Density 0.129%