INDEX
    Explanations

    references to incidents of violence or hate crimes

    New Auto-Interp
    Negative Logits
    iceps
    -0.14
    illing
    -0.14
    compression
    -0.13
     ******************************************************************************↵
    -0.13
    428
    -0.13
    woo
    -0.13
    oÄį
    -0.13
    оÑĩек
    -0.13
     Bout
    -0.12
    747
    -0.12
    POSITIVE LOGITS
     vandalism
    0.43
     vandal
    0.41
     graffiti
    0.38
     spray
    0.33
     gra
    0.31
     damage
    0.30
     Spray
    0.29
     ван
    0.29
     arson
    0.28
     Gra
    0.28
    Act Density 0.057%

    No Known Activations