INDEX
Explanations
the word "who" in various contexts
New Auto-Interp
Negative Logits
robat
-0.18
mented
-0.17
cede
-0.16
cola
-0.16
roi
-0.16
алеж
-0.16
illos
-0.16
ning
-0.15
bian
-0.15
ÑĥÑĢÑĥ
-0.15
POSITIVE LOGITS
else
0.31
ops
0.25
oping
0.25
osh
0.24
ever
0.22
opi
0.21
Else
0.21
ELSE
0.20
opsy
0.20
_else
0.19
Activations Density 0.032%