INDEX
    Explanations

    mentions of "killing" or related actions

    New Auto-Interp
    Negative Logits
    iland
    -0.21
    eck
    -0.17
    iyel
    -0.15
    sez
    -0.15
    evice
    -0.15
    onto
    -0.14
    asley
    -0.14
    aland
    -0.14
    ulled
    -0.14
    leh
    -0.14
    POSITIVE LOGITS
     off
    0.25
    joy
    0.23
     outright
    0.23
     spree
    0.23
    deer
    0.22
     innocent
    0.21
    -off
    0.21
     indiscrim
    0.21
    çİ°åľº
    0.20
    switch
    0.19
    Act Density 0.057%

    No Known Activations