INDEX
Explanations
references to social roles and group identities
New Auto-Interp
Negative Logits
Halk
-0.15
olik
-0.15
round
-0.15
_UNUSED
-0.14
UTH
-0.14
heels
-0.14
VICES
-0.13
xico
-0.13
both
-0.13
Wide
-0.13
POSITIVE LOGITS
lint
0.17
ynos
0.16
azen
0.16
engin
0.15
.cod
0.14
åıį
0.14
bÃŃ
0.14
ños
0.14
ارد
0.14
_FONT
0.13
Activations Density 0.079%