INDEX
Explanations
references to gender, particularly related to men
New Auto-Interp
Negative Logits
er
-0.20
tica
-0.19
eriod
-0.18
engin
-0.18
erif
-0.17
erin
-0.16
dete
-0.16
erde
-0.16
ted
-0.16
agini
-0.15
POSITIVE LOGITS
folk
0.49
opause
0.46
aces
0.38
ial
0.38
ager
0.35
aced
0.34
ials
0.33
/w
0.33
's
0.31
fol
0.30
Activations Density 0.034%