INDEX
Explanations
instances of the word "who."
New Auto-Interp
Negative Logits
mente
-0.16
idan
-0.16
utor
-0.16
ned
-0.16
lix
-0.15
idious
-0.15
uries
-0.15
ly
-0.15
rad
-0.15
nya
-0.14
POSITIVE LOGITS
oping
0.34
upon
0.28
oped
0.23
soever
0.23
've
0.23
'd
0.21
’ve
0.21
ever
0.20
’d
0.20
despite
0.20
Activations Density 0.141%