INDEX
Explanations
references to or descriptions of women
mentions of women in various contexts
New Auto-Interp
Negative Logits
ypes
-0.92
Flavoring
-0.89
agascar
-0.79
ython
-0.79
raltar
-0.78
UFF
-0.76
vernment
-0.76
inctions
-0.73
rador
-0.72
ernels
-0.72
POSITIVE LOGITS
izer
1.09
hood
1.05
folk
0.95
pher
0.94
cule
0.87
Louise
0.82
woman
0.81
izers
0.80
who
0.80
herself
0.79
Activations Density 0.048%