INDEX
Explanations
references to the word "who" in various contexts, indicating a focus on questions about identity or roles within a narrative
New Auto-Interp
Negative Logits
ÑĢалÑĮ
-0.16
woo
-0.16
ault
-0.15
pute
-0.15
spi
-0.15
rana
-0.15
rogram
-0.14
181
-0.13
erais
-0.13
atik
-0.13
POSITIVE LOGITS
else
0.35
_else
0.24
ELSE
0.21
soever
0.21
/how
0.20
exactly
0.20
Else
0.19
else
0.18
else
0.18
osh
0.18
Activations Density 0.029%