INDEX
    Explanations

    expressions and instances related to hate speech and its consequences

    New Auto-Interp
    Head Attr Weights
    0:0.01
    1:0.01
    2:0.09
    3:0.05
    4:0.05
    5:0.03
    6:0.33
    7:0.06
    8:0.03
    9:0.03
    10:0.17
    11:0.09
    Negative Logits
    -1.46
    */(
    -1.43
    -1.42
    pole
    -1.31
    NEY
    -1.28
     baseman
    -1.28
    -1.28
    ESA
    -1.26
    itte
    -1.24
    aple
    -1.24
    POSITIVE LOGITS
     intimidation
    1.47
     terrorism
    1.31
    itives
    1.21
    hate
    1.20
     terror
    1.19
    upload
    1.17
    illance
    1.16
     Cthulhu
    1.16
     vandalism
    1.16
     bullying
    1.15
    Act Density 0.005%

    No Known Activations