INDEX
Explanations
instances of abusive or threatening behavior
Offensive or abusive language
personal attacks
New Auto-Interp
Negative Logits
发表于
-0.58
parad
-0.57
__':
-0.57
coltà
-0.56
Voluntary
-0.53
[]
-0.52
preved
-0.51
Italijanski
-0.51
panik
-0.51
ύπ
-0.51
POSITIVE LOGITS
insults
1.30
harassment
1.20
insulting
1.17
insult
1.13
bullying
1.06
insulted
1.05
harassing
0.96
tau
0.95
threats
0.94
attacks
0.94
Activations Density 0.465%