INDEX
Explanations
references to the term "the" and its variations in phrases
New Auto-Interp
Negative Logits
<unused42>
-1.27
<unused41>
-1.27
<pad>
-1.26
[@BOS@]
-1.26
<unused68>
-1.26
<unused74>
-1.26
<unused43>
-1.26
<unused23>
-1.26
<unused3>
-1.26
<unused14>
-1.26
POSITIVE LOGITS
,
0.42
↵
0.40
1
0.38
for
0.36
I
0.35
and
0.34
I
0.33
↵↵
0.32
:
0.30
The
0.30
Activations Density 0.300%