INDEX
Explanations
pronouns referring to male individuals
New Auto-Interp
Negative Logits
iland
-0.15
edii
-0.14
ë³ij
-0.14
((__
-0.14
uego
-0.14
Äĥm
-0.13
-opacity
-0.13
rupted
-0.13
oggle
-0.13
ZemÄĽ
-0.13
POSITIVE LOGITS
or
0.19
/her
0.15
ados
0.15
rello
0.14
idi
0.14
zan
0.14
andise
0.14
golf
0.14
Bek
0.14
olen
0.14
Activations Density 0.152%