INDEX
Explanations
references to the word "who" in various contexts
New Auto-Interp
Negative Logits
robat
-0.20
Darling
-0.17
_PD
-0.15
çį
-0.15
ted
-0.15
mented
-0.15
aster
-0.14
bian
-0.14
ented
-0.14
nt
-0.14
POSITIVE LOGITS
else
0.30
ops
0.28
ever
0.27
opi
0.23
am
0.22
ELSE
0.22
osh
0.21
Else
0.21
oping
0.20
needs
0.20
Activations Density 0.024%