INDEX
    Explanations

    Harassment and abusive content

    New Auto-Interp
    Negative Logits
    zahl
    -0.08
     Lassen
    -0.08
    guardian
    -0.08
     copa
    -0.08
    issor
    -0.08
    -même
    -0.08
    _elapsed
    -0.08
    -là
    -0.08
     miraculous
    -0.08
    最快
    -0.07
    POSITIVE LOGITS
    /conf
    0.08
     harassment
    0.08
     victim
    0.08
     envers
    0.08
    rechte
    0.07
     rede
    0.07
    Rede
    0.07
     political
    0.07
     Rede
    0.07
     screened
    0.07
    Act Density 0.010%

    No Known Activations