INDEX
    Explanations

    harmful activities and content

    New Auto-Interp
    Negative Logits
    os
    0.41
    on
    0.40
    ز
    0.38
    েল
    0.37
    с
    0.37
    مان
    0.36
    ів
    0.35
    ной
    0.35
    ના
    0.35
    з
    0.35
    POSITIVE LOGITS
     ר
    0.36
     victimization
    0.36
     assemblages
    0.35
     attacks
    0.33
     raids
    0.32
     repris
    0.32
     акции
    0.31
     advis
    0.30
     screenings
    0.30
     harassment
    0.29
    Act Density 0.554%

    No Known Activations