INDEX
Explanations
references to female identities and gender dynamics
New Auto-Interp
Negative Logits
nya
-0.15
ese
-0.15
ary
-0.15
meld
-0.15
lu
-0.15
ning
-0.15
nel
-0.14
ally
-0.14
sel
-0.14
rig
-0.14
POSITIVE LOGITS
itarian
0.19
volent
0.18
æ´²
0.18
åĪ¥
0.18
factor
0.17
erre
0.16
.Flag
0.15
Outlined
0.14
åĪ«
0.14
hoÃłng
0.14
Activations Density 0.015%