INDEX
Explanations
instances of the pronoun "he" and its variations
New Auto-Interp
Negative Logits
were
-0.20
from
-0.19
itself
-0.18
on
-0.17
during
-0.15
ly
-0.15
isContained
-0.15
nbsp
-0.15
ness
-0.15
across
-0.15
POSITIVE LOGITS
'd
0.64
'll
0.61
’d
0.53
/she
0.50
’ll
0.50
've
0.44
're
0.41
eding
0.40
's
0.39
himself
0.36
Activations Density 0.340%