INDEX
Explanations
terms related to gender, including its equality, identity, and roles in society
New Auto-Interp
Negative Logits
vip
-0.16
.metamodel
-0.15
sher
-0.15
_RING
-0.15
rish
-0.14
yling
-0.14
lashes
-0.14
ãĤ¯ãĤ»
-0.14
yan
-0.14
signature
-0.14
POSITIVE LOGITS
ed
0.37
roles
0.28
que
0.27
-neutral
0.26
neutral
0.26
fluid
0.25
edn
0.24
neutral
0.23
-fluid
0.23
Roles
0.22
Activations Density 0.013%