INDEX
Explanations
questions and statements that inquire about the reasons behind actions or beliefs
New Auto-Interp
Negative Logits
WHETHER
-0.15
dazu
-0.14
ino
-0.14
ynamo
-0.14
.YesNo
-0.14
scan
-0.14
chet
-0.14
oret
-0.14
tons
-0.14
Ïĥκε
-0.13
POSITIVE LOGITS
/how
0.43
soever
0.33
they
0.28
we
0.28
exactly
0.27
it
0.26
there
0.24
bother
0.23
certain
0.23
/if
0.23
Activations Density 0.051%