INDEX
    Explanations

    instances of violence or aggressive actions

    New Auto-Interp
    Negative Logits
    ught
    -0.08
    prav
    -0.07
    .createObject
    -0.07
    eed
    -0.06
    ấn
    -0.06
    fal
    -0.06
     meis
    -0.06
     spo
    -0.06
    ÏĦαι
    -0.06
    íĻĺ
    -0.06
    POSITIVE LOGITS
     into
    0.09
     off
    0.08
     away
    0.08
    into
    0.07
    ano
    0.07
    anos
    0.07
    aran
    0.06
     Geh
    0.06
    _into
    0.06
     back
    0.06
    Act Density 0.068%

    No Known Activations