INDEX
Explanations
violence, abuse, and harmful behavior
New Auto-Interp
Negative Logits
dividend
0.59
dividends
0.49
Y
0.46
dividend
0.45
panning
0.44
Tom
0.42
artifacts
0.42
Dividend
0.42
품질
0.42
arbitrage
0.41
POSITIVE LOGITS
Violence
0.91
violência
0.89
violence
0.88
Violence
0.88
perpetrators
0.84
perpetrator
0.84
violencia
0.82
bullying
0.81
虐
0.80
bullying
0.79
Activations Density 0.390%