INDEX
Explanations
pronouns followed by verbs
pronouns, particularly the word "he" and "she."
New Auto-Interp
Negative Logits
noon
-0.85
rocket
-0.69
anking
-0.64
iries
-0.64
uits
-0.61
earch
-0.60
intervening
-0.60
NAT
-0.59
reach
-0.58
Role
-0.58
POSITIVE LOGITS
said
1.00
wrote
0.97
'd
0.95
joked
0.93
said
0.89
tweeted
0.89
says
0.87
laughed
0.86
aeus
0.85
exclaimed
0.85
Activations Density 0.054%