INDEX
Explanations
references to identity and pronouns in the context of gender
New Auto-Interp
Negative Logits
èĥĨ
-0.07
ãĥ¡ãĥ©
-0.06
หล
-0.06
_GAP
-0.06
ška
-0.06
atoi
-0.06
uren
-0.06
arse
-0.06
lien
-0.06
-ignore
-0.06
POSITIVE LOGITS
dana
0.08
precision
0.08
avoid
0.07
sensitivity
0.07
avoid
0.07
usage
0.07
sensitive
0.07
respectful
0.07
sensit
0.07
.scalablytyped
0.07
Activations Density 0.007%