INDEX
Explanations
references to female characters or pronouns
New Auto-Interp
Negative Logits
Monfieur
-0.91
raiſ
-0.87
Efq
-0.87
Houſe
-0.85
againſt
-0.83
cauſe
-0.81
uſe
-0.79
itſelf
-0.77
purpoſe
-0.76
נטרנט
-0.76
POSITIVE LOGITS
her
1.79
his
1.51
Her
1.43
HER
1.33
her
1.29
Her
1.28
she
1.26
His
1.17
HIS
1.15
she
1.09
Activations Density 0.148%