INDEX
    Explanations

    instances of offensive language and hateful speech

    New Auto-Interp
    Negative Logits
    COUVER
    -0.63
    TagMode
    -0.63
    InjectMocks
    -0.60
    __))
    -0.59
    findpost
    -0.59
     Administrativna
    -0.56
     estimés
    -0.56
    PropertyChanging
    -0.56
    errHandler
    -0.55
    ValueStyle
    -0.54
    POSITIVE LOGITS
     racist
    0.73
    racist
    0.67
     offensive
    0.64
     degrading
    0.64
     offensi
    0.61
     Rac
    0.59
     Offensive
    0.58
     hateful
    0.58
     haine
    0.57
     racism
    0.56
    Act Density 0.084%

    No Known Activations