INDEX
Explanations
mentions of mothers
references to the word "mom."
New Auto-Interp
Negative Logits
Flavoring
-0.87
lihood
-0.83
REDACTED
-0.70
vernment
-0.69
ENGTH
-0.62
Morales
-0.61
Voting
-0.61
etting
-0.60
Crimes
-0.59
CONT
-0.59
POSITIVE LOGITS
my
1.25
hesis
1.10
mers
1.04
ma
1.04
mer
0.92
mas
0.88
mom
0.85
wife
0.84
entary
0.84
heses
0.84
Activations Density 0.014%