INDEX
Explanations
sexist, misogynistic, or blaming women
New Auto-Interp
Negative Logits
રહ્યો
0.98
ગયો
0.88
אתה
0.87
நண்ப
0.85
метр
0.81
раствора
0.75
涣
0.73
તો
0.73
CONFIG
0.71
jego
0.70
POSITIVE LOGITS
women
4.22
female
4.21
feminist
4.07
feminine
3.92
feminism
3.85
여성
3.83
Women
3.82
Women
3.79
feminists
3.76
femininity
3.75
Activations Density 0.709%