INDEX
    Explanations

    exploit, abuse or endanger

    New Auto-Interp
    Negative Logits
    و
    1.10
    1.09
    0.93
     În
    0.91
    ueur
    0.91
    0.88
    subjects
    0.83
    па
    0.82
     Ин
    0.82
     it
    0.81
    POSITIVE LOGITS
    :
    1.92
    )
    1.33
    (
    1.23
    /
    1.17
    ;
    1.16
    ),
    1.07
    $
    1.05
     Abuse
    1.00
    ق
    0.97
     abusive
    0.97
    Act Density 0.015%

    No Known Activations