INDEX
Explanations
instances of the word "who."
New Auto-Interp
Negative Logits
robat
-0.20
ted
-0.17
Darling
-0.17
bian
-0.16
mented
-0.15
bol
-0.15
cline
-0.15
aises
-0.15
nt
-0.15
net
-0.14
POSITIVE LOGITS
ops
0.30
ever
0.30
else
0.28
opi
0.26
oping
0.26
osh
0.25
am
0.23
op
0.22
opsy
0.22
ope
0.21
Activations Density 0.025%