INDEX
Explanations
mentions of family members, particularly mothers
references to the concept of "mom"
New Auto-Interp
Negative Logits
lihood
-0.80
vernment
-0.74
Flavoring
-0.72
ENGTH
-0.66
veyard
-0.65
chnology
-0.65
IGHTS
-0.64
RAFT
-0.64
Tribunal
-0.62
isson
-0.61
POSITIVE LOGITS
hesis
1.18
my
1.07
ma
1.00
mom
1.00
wife
0.94
Mom
0.88
heses
0.87
dad
0.83
tor
0.82
Mom
0.82
Activations Density 0.013%