INDEX
Explanations
references to specific scientific terms or variables in mathematical contexts
New Auto-Interp
Negative Logits
↵↵
-1.12
-1.09
↵
-1.08
,
-1.04
1
-0.96
2
-0.94
.
-0.93
(
-0.89
/
-0.88
-0.88
POSITIVE LOGITS
<unused43>
1.80
<unused41>
1.80
<pad>
1.79
<unused3>
1.79
<unused74>
1.79
<unused79>
1.79
<unused42>
1.79
<unused28>
1.79
<unused8>
1.79
<unused14>
1.79
Activations Density 0.018%