INDEX
Explanations
negative social interactions related to controversy and offensive behavior
language related to insults and derogatory remarks
New Auto-Interp
Negative Logits
tnc
-0.75
Instit
-0.73
profits
-0.73
uph
-0.69
doi
-0.69
natureconservancy
-0.69
Effect
-0.68
Architects
-0.67
nav
-0.67
oha
-0.67
POSITIVE LOGITS
slurs
1.72
insults
1.66
derogatory
1.52
homophobic
1.51
vulgar
1.42
racist
1.37
sexist
1.36
abusive
1.36
sarcastic
1.34
hateful
1.33
Activations Density 0.358%