INDEX
Explanations
references to men and masculinity
New Auto-Interp
Negative Logits
غاÙĨ
-0.15
ede
-0.15
xy
-0.15
verty
-0.15
ture
-0.14
asan
-0.14
ÑĥÑģе
-0.14
edian
-0.14
pon
-0.13
universal
-0.13
POSITIVE LOGITS
aces
0.17
volent
0.16
opause
0.16
iscal
0.16
chor
0.16
ardu
0.15
acing
0.15
inery
0.14
ylim
0.14
âl
0.14
Activations Density 0.069%