INDEX
Explanations
quotation marks and dialogue in the text
New Auto-Interp
Negative Logits
,
-0.25
"
-0.17
[
-0.17
.
-0.16
 
-0.16
:
-0.15
"s
-0.15
,**
-0.15
"",
-0.15
\n
-0.15
POSITIVE LOGITS
and
0.40
And
0.39
And
0.39
but
0.32
But
0.26
"↵
0.25
But
0.25
but
0.25
Also
0.25
and
0.25
Activations Density 0.044%