INDEX
Explanations
overcoming discrimination and prejudice
reinforces harmful stereotypes
New Auto-Interp
Negative Logits
अनुर
0.50
необы
0.49
enthusiast
0.49
runny
0.46
moelle
0.44
உற்ச
0.43
veloce
0.42
Reliability
0.42
সন্ন
0.42
பக்தர்கள்
0.42
POSITIVE LOGITS
discriminatory
1.84
sexism
1.80
discrimination
1.73
racism
1.73
racist
1.70
misog
1.69
sexist
1.63
Discrimination
1.55
Racism
1.54
discrimination
1.53
Activations Density 0.306%