INDEX
    Explanations

    references to violence and brutality

    New Auto-Interp
    Negative Logits
     Dangerous
    -0.16
    Danger
    -0.15
     dangerous
    -0.15
     Mint
    -0.15
    danger
    -0.14
    anes
    -0.14
     Danger
    -0.14
    chner
    -0.14
     menacing
    -0.14
     Paran
    -0.14
    POSITIVE LOGITS
     dec
    0.25
     dissect
    0.24
     dis
    0.22
     hacked
    0.20
     viv
    0.20
     decomposition
    0.20
     decom
    0.19
     киÑĪ
    0.19
     mutil
    0.19
     scal
    0.19
    Act Density 0.223%

    No Known Activations