INDEX
Explanations
questions
question and answer formats in the text
New Auto-Interp
Negative Logits
outweigh
-0.70
atten
-0.68
outwe
-0.68
guards
-0.68
ords
-0.66
coales
-0.66
comple
-0.64
oval
-0.63
bloom
-0.61
umblr
-0.60
POSITIVE LOGITS
Why
0.97
WHY
0.94
What
0.92
Explain
0.92
Why
0.87
Hi
0.83
Hello
0.83
Hello
0.80
How
0.79
Suppose
0.79
Activations Density 0.054%