INDEX
Explanations
promoting prejudice and discrimination
New Auto-Interp
Negative Logits
femminile
0.55
feminine
0.54
女性
0.50
жі
0.49
female
0.48
Female
0.47
feminina
0.47
femin
0.47
fémin
0.46
ktop
0.46
POSITIVE LOGITS
prejudice
1.62
hatred
1.52
hate
1.41
bigotry
1.41
prejudices
1.40
prejudiced
1.38
discrimination
1.35
Prejudice
1.35
racism
1.25
hateful
1.25
Activations Density 0.055%