INDEX
Explanations
references to gender and pronouns
pronouns and gendered terms
New Auto-Interp
Negative Logits
Piece
-0.49
Pile
-0.49
enters
-0.47
yip
-0.47
Vital
-0.46
Pure
-0.45
osto
-0.45
aneously
-0.44
apas
-0.44
癡
-0.44
POSITIVE LOGITS
pronouns
0.62
pronoun
0.54
ьаж
0.48
&___
0.47
Comprometido
0.44
orgánico
0.41
integridad
0.41
Étymologie
0.41
humo
0.40
astéro
0.40
Activations Density 0.004%