INDEX
    Explanations

    references to hate crimes and violence against marginalized communities

    New Auto-Interp
    Negative Logits
    abad
    -0.19
     Ross
    -0.14
    ãĤ¹ãĤ³
    -0.14
     moc
    -0.14
    ادÙħ
    -0.13
     Anc
    -0.13
    -Smith
    -0.13
    ROSS
    -0.13
     Cannon
    -0.13
    eneg
    -0.13
    POSITIVE LOGITS
    toi
    0.17
    ä»
    0.16
    æ®
    0.15
     Journalism
    0.15
     hate
    0.15
    Acts
    0.15
     incel
    0.14
     акÑĤи
    0.14
    tainment
    0.14
    finity
    0.14
    Act Density 0.061%

    No Known Activations