INDEX
Explanations
phrases related to societal norms and behavior
negative assertions about gender dynamics and societal roles
New Auto-Interp
Negative Logits
iHUD
-0.76
Chain
-0.69
hent
-0.66
predecessor
-0.65
éĹ
-0.64
someone
-0.64
client
-0.62
ainment
-0.62
plate
-0.61
holder
-0.61
POSITIVE LOGITS
outnumbered
1.07
disproportionately
1.05
joice
0.96
rejoice
0.90
overwhelmingly
0.89
flock
0.87
dominate
0.86
smarter
0.84
behave
0.84
enslaved
0.84
Activations Density 0.621%