INDEX
Explanations
derogatory terms and expressions, including personal attacks and insults
derogatory terms and discussions around defamation and negative speech related to individuals or groups
New Auto-Interp
Negative Logits
transitions
-0.76
milestone
-0.72
longitudinal
-0.68
transition
-0.65
ELE
-0.65
wearable
-0.65
outdoor
-0.65
infrared
-0.64
Transition
-0.63
outgoing
-0.61
POSITIVE LOGITS
ueless
1.06
hypocrisy
1.04
slander
1.03
hypocr
1.02
hypocritical
1.02
udicrous
1.01
entious
1.00
bigotry
0.98
contempt
0.96
insulting
0.96
Activations Density 0.318%