INDEX
    Explanations

    references to violence and violent actions

    New Auto-Interp
    Negative Logits
    chg
    -0.16
     èĥ
    -0.15
    à¥Īल
    -0.15
    serter
    -0.15
    ãģ°
    -0.15
    icari
    -0.15
    Insensitive
    -0.15
    ãĥ£
    -0.14
    ibble
    -0.14
    ÑĩÑĥк
    -0.14
    POSITIVE LOGITS
    -force
    0.17
    ernet
    0.17
    /or
    0.15
    ³»
    0.14
     force
    0.14
    lette
    0.14
    adier
    0.14
    al
    0.14
    force
    0.14
    _force
    0.14
    Act Density 0.018%

    No Known Activations