INDEX
Explanations
interrogative phrases and questions relating to reasoning and analysis
Follows "how," "why," or "where"
how why where questions
New Auto-Interp
Negative Logits
he
-0.64
they
-0.62
it
-0.52
we
-0.47
hiran
-0.45
juvant
-0.43
the
-0.43
the
-0.43
он
-0.42
they
-0.42
POSITIVE LOGITS
does
1.26
do
1.21
did
1.20
Does
0.99
Did
0.96
are
0.88
Does
0.88
Did
0.86
is
0.79
can
0.77
Activations Density 0.161%