INDEX
Explanations
instances of offensive language and hateful speech
New Auto-Interp
Negative Logits
COUVER
-0.63
TagMode
-0.63
InjectMocks
-0.60
__))
-0.59
findpost
-0.59
Administrativna
-0.56
estimés
-0.56
PropertyChanging
-0.56
errHandler
-0.55
ValueStyle
-0.54
POSITIVE LOGITS
racist
0.73
racist
0.67
offensive
0.64
degrading
0.64
offensi
0.61
Rac
0.59
Offensive
0.58
hateful
0.58
haine
0.57
racism
0.56
Activations Density 0.084%