INDEX
    Explanations

    violence and physical attacks

    New Auto-Interp
    Negative Logits
    enuity
    -0.07
    uang
    -0.06
     Rou
    -0.06
    τει
    -0.06
     bằng
    -0.06
     circulated
    -0.06
     sanctioned
    -0.06
    -0.05
    .getOrder
    -0.05
    ウェ
    -0.05
    POSITIVE LOGITS
     flexible
    0.07
     impress
    0.07
     Chess
    0.07
    MOVED
    0.07
    Purple
    0.07
     yyn
    0.06
     repository
    0.06
    ifest
    0.06
    :NS
    0.06
    invalidate
    0.06
    Act Density 0.180%

    No Known Activations