INDEX
    Explanations

    mentions related to violent actions or harm being inflicted

    actions or events associated with causing harm or violence

    New Auto-Interp
    Negative Logits
    gers
    -0.88
    cius
    -0.86
    ger
    -0.79
    nings
    -0.79
    bard
    -0.79
    cean
    -0.79
    ese
    -0.78
    sell
    -0.77
    gered
    -0.75
    acea
    -0.75
    POSITIVE LOGITS
    cipline
    0.71
    ngth
    0.70
     Lauder
    0.70
     Vict
    0.70
    verages
    0.68
     Seym
    0.67
    lehem
    0.67
    awei
    0.66
    enance
    0.66
    INGTON
    0.62
    Act Density 0.074%

    No Known Activations