INDEX
Explanations
references to gender identity, sexual orientation, and related education topics
New Auto-Interp
Negative Logits
eyn
-0.16
Sud
-0.14
sud
-0.14
_splits
-0.14
.masks
-0.14
upos
-0.14
speculative
-0.14
Sibling
-0.14
igg
-0.14
inals
-0.14
POSITIVE LOGITS
sex
0.68
Sex
0.54
sex
0.50
SEX
0.48
-sex
0.48
Sex
0.47
_sex
0.44
.sex
0.43
sexual
0.38
SEX
0.38
Activations Density 0.085%