INDEX
Explanations
negative terms or insults
derogatory terms and insults directed at individuals or groups
New Auto-Interp
Negative Logits
tnc
-0.76
ondo
-0.74
RH
-0.74
ira
-0.73
winner
-0.72
Recomm
-0.70
APH
-0.69
ranging
-0.69
ANC
-0.68
iture
-0.68
POSITIVE LOGITS
idiots
1.03
idiot
1.02
bastard
0.98
bully
0.97
spew
0.95
hypoc
0.91
bitch
0.90
asshole
0.89
sucker
0.88
crap
0.86
Activations Density 0.072%