INDEX
    Explanations

    expressions of insult or derogatory remarks

    New Auto-Interp
    Negative Logits
    hin
    -0.17
    лиÑĨ
    -0.16
    šet
    -0.16
    overy
    -0.16
    UIFont
    -0.15
    夫
    -0.15
    eka
    -0.15
    reno
    -0.15
    ahat
    -0.15
    rze
    -0.14
    POSITIVE LOGITS
    ably
    0.16
    nit
    0.15
    alla
    0.15
    berman
    0.14
    odate
    0.14
     Morrison
    0.14
    peria
    0.14
    ÎĨ
    0.13
    breaker
    0.13
    /api
    0.13
    Act Density 0.003%

    No Known Activations