INDEX
Explanations
occurrences of the word "who" in various contexts
New Auto-Interp
Negative Logits
robat
-0.18
rogram
-0.17
ented
-0.16
Darling
-0.16
uers
-0.15
umer
-0.15
бе
-0.14
_stuff
-0.14
urge
-0.14
ipher
-0.14
POSITIVE LOGITS
else
0.26
wouldn
0.24
ops
0.24
ever
0.22
ever
0.21
osh
0.18
else
0.18
opi
0.18
-ever
0.17
op
0.17
Activations Density 0.020%