INDEX
    Explanations

    phrases related to hate speech and hate crimes

    references to hate and hate speech

    New Auto-Interp
    Negative Logits
    ufact
    -0.82
    æ©Ł
    -0.82
    UNCH
    -0.80
    å§«
    -0.79
    aver
    -0.76
    idges
    -0.73
    Decre
    -0.71
    é¾įå
    -0.71
     Tablet
    -0.70
    ODE
    -0.70
    POSITIVE LOGITS
    fulness
    1.15
    fully
    1.12
     crimes
    0.96
    ful
    0.91
     vengeance
    0.86
     hate
    0.86
    hate
    0.79
     retaliation
    0.78
    ãĥĨ
    0.78
    bre
    0.78
    Act Density 0.017%

    No Known Activations