INDEX
Explanations
phrases related to discrimination and prejudice, particularly focusing on sexism
terms related to sexism and misogyny
New Auto-Interp
Negative Logits
Trust
-0.76
leaf
-0.73
Package
-0.71
hyde
-0.71
mental
-0.69
ernels
-0.68
uilding
-0.68
VIS
-0.68
ving
-0.67
NAS
-0.67
POSITIVE LOGITS
sexist
1.03
misogyn
0.89
slurs
0.87
jokes
0.81
stereotypes
0.79
stereotyp
0.78
banter
0.76
Equality
0.74
feminists
0.73
sexism
0.73
Activations Density 0.018%