INDEX
Explanations
questions starting with "How" and their variants
New Auto-Interp
Negative Logits
whether
-0.14
lier
-0.14
WA
-0.14
ovable
-0.14
Whether
-0.14
ÏĢι
-0.13
aneously
-0.13
elier
-0.13
406
-0.13
proh
-0.13
POSITIVE LOGITS
did
0.35
does
0.29
do
0.29
did
0.27
Did
0.22
Did
0.22
long
0.21
.did
0.21
)did
0.21
old
0.20
Activations Density 0.036%