INDEX
Explanations
references to LGBTQ+ identities and specifically terms related to "gay."
New Auto-Interp
Negative Logits
inus
-0.17
cker
-0.17
帯
-0.16
rál
-0.16
sg
-0.16
ahy
-0.15
sk
-0.15
iams
-0.15
ÑģÑĤин
-0.15
ейÑģÑĤв
-0.15
POSITIVE LOGITS
lord
0.30
dar
0.26
atri
0.25
-rights
0.24
bor
0.22
lords
0.21
rights
0.21
-friendly
0.20
ety
0.20
est
0.20
Activations Density 0.010%