INDEX
Explanations
references to women
references to women
New Auto-Interp
Negative Logits
ypes
-0.86
Flavoring
-0.83
rador
-0.78
UFF
-0.76
ython
-0.75
ysical
-0.74
agascar
-0.74
umbn
-0.73
rss
-0.73
DIS
-0.71
POSITIVE LOGITS
hood
1.11
izer
1.01
folk
0.94
pher
0.85
cule
0.83
woman
0.82
who
0.81
izers
0.77
uscript
0.76
comed
0.75
Activations Density 0.041%