INDEX
Explanations
references to men or male-related terms
New Auto-Interp
Negative Logits
erc
-0.15
FactoryBot
-0.15
gart
-0.15
Ī
-0.15
(::
-0.15
ighton
-0.15
er
-0.15
oslav
-0.14
(es
-0.14
лиÑĨ
-0.14
POSITIVE LOGITS
opause
0.28
cken
0.25
endez
0.24
iscal
0.23
ager
0.23
aced
0.22
acing
0.22
едж
0.21
aces
0.21
ubar
0.21
Activations Density 0.021%