INDEX
Explanations
references to collective pronouns related to groups or individuals
New Auto-Interp
Negative Logits
itself
-0.22
was
-0.16
ocate
-0.15
nad
-0.14
nut
-0.14
st
-0.14
atti
-0.13
isnt
-0.13
nbsp
-0.13
page
-0.13
POSITIVE LOGITS
themselves
0.37
’re
0.36
are
0.34
're
0.34
've
0.29
’ve
0.26
were
0.25
'll
0.25
’ll
0.23
/she
0.23
Activations Density 0.213%