INDEX
Explanations
references to social hierarchies and relationships
New Auto-Interp
Negative Logits
kvinnor
-0.67
kvinder
-0.67
vrouwen
-0.61
девочки
-0.61
girls
-0.59
ženy
-0.59
PerformLayout
-0.58
Girls
-0.57
žena
-0.57
mujeres
-0.57
POSITIVE LOGITS
spin
0.56
spin
0.48
courtes
0.46
virgin
0.45
dow
0.45
maiden
0.45
Spin
0.43
Amazon
0.43
ny
0.42
vir
0.41
Activations Density 0.473%