INDEX
Explanations
references to sexual orientation and gender identity
New Auto-Interp
Negative Logits
eyn
-0.17
sud
-0.16
Sud
-0.15
Sloan
-0.15
inals
-0.15
Schwe
-0.15
slave
-0.14
ril
-0.14
speculative
-0.14
ameleon
-0.14
POSITIVE LOGITS
sex
0.63
Sex
0.48
sex
0.45
-sex
0.45
SEX
0.44
Sex
0.40
.sex
0.39
_sex
0.39
æĢ§
0.37
æĢ§
0.35
Activations Density 0.064%