INDEX
Explanations
code initialization and structure
New Auto-Interp
Negative Logits
[
1.10
$[\
0.94
[\
0.93
[$
0.91
([
0.90
$[
0.88
[\
0.87
[`
0.87
[
0.87
[<
0.84
POSITIVE LOGITS
{-0.94
{-0.86
={0.82
={0.79
={"0.77
{0.76
,{0.75
{&0.72
{(0.71
{"0.70
Activations Density 0.006%