INDEX
Explanations
references to individuals or groups mentioned in the text
who followed by a verb
New Auto-Interp
Negative Logits
.
-0.35
,
-0.32
and
-0.32
<eos>
-0.32
:
-0.31
I
-0.31
!
-0.31
several
-0.30
none
-0.30
inter
-0.29
POSITIVE LOGITS
<unused28>
0.83
<unused23>
0.83
<unused68>
0.83
<unused74>
0.83
<unused41>
0.82
<unused16>
0.82
<unused79>
0.82
<unused80>
0.82
[@BOS@]
0.82
<unused8>
0.82
Activations Density 0.038%