INDEX
Explanations
references to gender and youth
New Auto-Interp
Negative Logits
hood
-0.17
igy
-0.16
indle
-0.15
thon
-0.15
524
-0.15
etros
-0.15
domicile
-0.15
gesch
-0.14
oleon
-0.14
embed
-0.14
POSITIVE LOGITS
itter
0.22
burg
0.18
'
0.18
ingers
0.17
ITTER
0.16
Madden
0.16
itters
0.15
cout
0.15
-only
0.15
gint
0.15
Activations Density 0.045%