INDEX
Explanations
negative statements or controversies related to public figures
derogatory terms and phrases related to social issues and public figures
New Auto-Interp
Negative Logits
prepar
-0.70
Incre
-0.63
igree
-0.62
vantage
-0.61
eworks
-0.59
yrinth
-0.59
rieve
-0.59
accompan
-0.59
ilitation
-0.58
synerg
-0.58
POSITIVE LOGITS
sexist
1.26
racist
1.19
homophobic
1.18
misogyny
1.18
misogyn
1.17
racists
1.14
sexism
1.10
feminists
1.10
homophobia
1.10
slurs
1.09
Activations Density 1.010%