INDEX
Explanations
instances of verbal abuse or harassment
New Auto-Interp
Negative Logits
arent
-0.85
roxy
-0.83
bard
-0.82
akeru
-0.81
enthal
-0.81
alach
-0.79
avorite
-0.79
xon
-0.78
ktop
-0.75
uden
-0.74
POSITIVE LOGITS
altercation
1.04
ized
0.98
verbal
0.97
isations
0.96
izing
0.92
izations
0.92
ization
0.91
communication
0.89
abuse
0.89
spar
0.86
Activations Density 0.007%