INDEX
Explanations
the word "who" to identify subjects or individuals in various contexts
New Auto-Interp
Negative Logits
rogram
-0.16
ningen
-0.16
abd
-0.15
алеж
-0.15
woo
-0.14
ault
-0.14
uet
-0.14
ted
-0.14
rov
-0.13
spi
-0.13
POSITIVE LOGITS
else
0.35
_else
0.23
ELSE
0.22
soever
0.21
exactly
0.21
/how
0.20
Else
0.20
else
0.19
opi
0.18
else
0.18
Activations Density 0.029%