INDEX
Explanations
concepts related to gender roles and identities
New Auto-Interp
Negative Logits
ãģĿãģ®ä»ĸ
-0.16
ourd
-0.15
ÃŃky
-0.14
(other
-0.14
roat
-0.14
altri
-0.14
#End
-0.14
ylko
-0.13
agger
-0.13
ãĥŃãĥ³
-0.13
POSITIVE LOGITS
numerator
0.31
left
0.25
either
0.23
east
0.22
Left
0.22
either
0.21
male
0.21
north
0.20
Either
0.20
offense
0.20
Activations Density 0.686%