INDEX
Explanations
references to gender identity and expressions related to faith and societal norms
New Auto-Interp
Negative Logits
znik
-0.14
acaÄŁ
-0.13
ichten
-0.13
reck
-0.13
LIABLE
-0.13
quit
-0.12
innitus
-0.12
íĴĪ
-0.12
domest
-0.12
åĿ
-0.12
POSITIVE LOGITS
gender
0.42
Gender
0.38
transgender
0.38
genders
0.34
Gender
0.34
gender
0.34
genitals
0.31
sex
0.31
genital
0.30
Assigned
0.30
Activations Density 0.070%