INDEX
Explanations
words related to specific groups of people, such as "People", "Men", and "Women"
references to different groups of people, specifically highlighting gender and identity
New Auto-Interp
Negative Logits
dads
-0.69
enegger
-0.67
husbands
-0.64
principals
-0.64
whipping
-0.63
fathers
-0.62
backbone
-0.62
parents
-0.61
stump
-0.60
pedoph
-0.60
POSITIVE LOGITS
ysc
0.82
MpServer
0.81
ettlement
0.77
Eater
0.75
rights
0.75
Ago
0.75
ascript
0.73
Soft
0.72
Killed
0.72
Own
0.72
Activations Density 0.094%