INDEX
Explanations
pronouns referring to a male person
repeated references to a specific individual
New Auto-Interp
Negative Logits
earch
-0.78
Peak
-0.68
higher
-0.66
awar
-0.64
rame
-0.64
veyard
-0.63
peak
-0.62
tones
-0.62
aura
-0.62
reshold
-0.62
POSITIVE LOGITS
'll
1.22
'd
1.20
zbollah
1.06
wrote
1.02
tweeted
1.02
've
0.90
resy
0.89
joked
0.89
wondered
0.88
penned
0.88
Activations Density 0.259%