INDEX
Explanations
parentheses and other grouping symbols in the text
New Auto-Interp
Negative Logits
$\
-0.87
_
-0.75
${-0.73
(-
-0.70
[
-0.70
{-0.68
@
-0.68
$-
-0.67
$-.
-0.67
$
-0.66
POSITIVE LOGITS
..)
1.12
,)
0.96
....)
0.93
…).
0.91
);
0.88
...),
0.86
.),
0.85
)
0.84
?),
0.84
),
0.83
Activations Density 0.512%