INDEX
    Explanations

    incidents involving hate crimes or violence against marginalized groups

    New Auto-Interp
    Negative Logits
     â̦
    -0.71
     [â̦]
    -0.63
     â̦.
    -0.60
     [â̦
    -0.56
     â̦↵↵
    -0.50
     [â̦]↵↵
    -0.43
    â̦.
    -0.43
    â̦â̦â̦â̦
    -0.41
    â̦â̦
    -0.40
    â̦..
    -0.39
    POSITIVE LOGITS
    ...↵
    0.69
    ,...↵
    0.56
    ....↵
    0.49
    ...↵↵
    0.47
    ',...↵
    0.42
     ...↵
    0.40
    ..."↵
    0.40
    ...
    0.39
    ,...↵↵
    0.38
    ...)↵
    0.38
    Act Density 0.453%

    No Known Activations