INDEX
Explanations
references to groups of people and personal pronouns
personal perspective
New Auto-Interp
Negative Logits
-0.47
with
-0.44
and
-0.40
,
-0.39
to
-0.39
in
-0.38
of
-0.38
↵
-0.37
a
-0.36
-
-0.35
POSITIVE LOGITS
[@BOS@]
1.41
<unused14>
1.40
<unused41>
1.40
<unused79>
1.40
<unused28>
1.40
<unused8>
1.40
<unused43>
1.40
<unused52>
1.40
<unused3>
1.39
<unused16>
1.39
Activations Density 0.159%