INDEX
Explanations
words related to possessive pronouns
instances of gender-neutral references
New Auto-Interp
Negative Logits
Integ
-0.66
prelim
-0.60
utsche
-0.60
AIR
-0.60
vib
-0.57
Interstitial
-0.56
names
-0.54
rap
-0.54
clearly
-0.54
Names
-0.53
POSITIVE LOGITS
nam
0.93
acles
0.85
Else
0.81
henko
0.81
chid
0.80
ifice
0.77
ilings
0.76
acle
0.74
ãĥ¼ãĥĨ
0.74
acular
0.73
Activations Density 0.055%