INDEX
Explanations
references to a specific pronoun for individuals, particularly focusing on their actions and statements
New Auto-Interp
Negative Logits
noon
-0.75
acters
-0.69
rocket
-0.69
iencies
-0.65
NAT
-0.63
disabling
-0.62
menstrual
-0.62
Measure
-0.61
berra
-0.59
atible
-0.59
POSITIVE LOGITS
said
1.18
replied
1.16
wrote
1.15
exclaimed
1.14
joked
1.09
laughed
1.04
remarked
1.03
says
1.03
tweeted
1.02
said
1.02
Activations Density 0.044%