INDEX
Explanations
questions ending with question marks
questions that prompt introspection or inquiry about various topics
New Auto-Interp
Negative Logits
athe
-0.71
ridor
-0.67
encount
-0.66
itialized
-0.66
onds
-0.65
ankles
-0.64
glim
-0.63
aper
-0.63
aled
-0.62
eper
-0.62
POSITIVE LOGITS
Surely
0.95
Wouldn
0.93
Nope
0.90
Why
0.90
.?
0.89
Certainly
0.88
Perhaps
0.86
What
0.85
Probably
0.85
Would
0.84
Activations Density 0.090%