INDEX
Explanations
cultural references and terms related to social roles and status
New Auto-Interp
Negative Logits
kvinnor
-0.70
kvinder
-0.70
vrouwen
-0.69
ženy
-0.66
Women
-0.66
women
-0.64
Girls
-0.63
féminine
-0.63
women
-0.62
женщин
-0.62
POSITIVE LOGITS
spin
0.53
dow
0.48
courtes
0.46
spin
0.44
maiden
0.44
seam
0.44
Dow
0.42
ancest
0.40
Spin
0.40
روس
0.39
Activations Density 0.452%