INDEX
Explanations
questions starting with "Are" and capital letters
interrogative phrases or questions
New Auto-Interp
Negative Logits
odder
-0.94
FTWARE
-0.76
rouse
-0.75
matter
-0.69
analysis
-0.68
uncture
-0.67
ication
-0.67
isine
-0.66
Dragonbound
-0.65
ruption
-0.64
POSITIVE LOGITS
wolves
0.95
these
0.90
you
0.85
those
0.82
THESE
0.82
pas
0.81
YOU
0.77
they
0.77
we
0.77
there
0.75
Activations Density 0.067%