INDEX
Explanations
the word "who" and its variations, indicating a focus on identity or inquiry about individuals
New Auto-Interp
Negative Logits
mented
-0.19
ning
-0.18
cola
-0.18
onte
-0.18
roi
-0.17
ces
-0.16
殿
-0.16
sona
-0.16
illos
-0.16
uros
-0.16
POSITIVE LOGITS
osh
0.24
else
0.24
oping
0.23
opi
0.21
ops
0.21
Else
0.19
opsy
0.18
ever
0.18
opers
0.17
soever
0.17
Activations Density 0.030%