INDEX
Explanations
misogyny and misogynistic language
New Auto-Interp
Negative Logits
inequality
0.51
Inequality
0.46
polyg
0.45
Sheep
0.40
ropole
0.40
wives
0.40
Focusing
0.39
focussing
0.39
focusing
0.38
лари
0.38
POSITIVE LOGITS
noir
0.44
oir
0.41
Savo
0.41
நிர்
0.40
вена
0.40
nicotin
0.40
żli
0.40
abilit
0.40
बचाया
0.39
soir
0.39
Activations Density 0.001%