INDEX
Explanations
instances of the word "who" in various contexts
New Auto-Interp
Negative Logits
ting
-0.17
ented
-0.17
ning
-0.16
ty
-0.16
ural
-0.15
cola
-0.15
smarty
-0.15
nox
-0.15
ng
-0.15
colo
-0.15
POSITIVE LOGITS
else
0.25
ever
0.17
Else
0.17
opup
0.16
oping
0.16
ãĥªãĥ³
0.16
osh
0.15
ELSE
0.15
else
0.15
ensch
0.15
Activations Density 0.040%