INDEX
    Explanations

    phrases related to making threats against others

    phrases related to threats and the intent to cause harm

    New Auto-Interp
    Negative Logits
     Parables
    -0.75
    cellent
    -0.73
    emis
    -0.72
     learners
    -0.71
    fortable
    -0.70
    ortunate
    -0.69
    eret
    -0.69
    erning
    -0.69
    kered
    -0.67
     admirable
    -0.67
    POSITIVE LOGITS
     blackmail
    0.98
     boycott
    0.97
     wrath
    0.97
     derail
    0.97
    quit
    0.94
     veto
    0.94
     sue
    0.94
     arrest
    0.91
     drown
    0.90
     ruin
    0.90
    Act Density 0.151%

    No Known Activations