INDEX
    Explanations

    occurrences of harmful or violent actions

    New Auto-Interp
    Negative Logits
     Stevenson
    -0.15
     Dank
    -0.15
    640
    -0.14
    ledo
    -0.14
    oppel
    -0.14
    iba
    -0.14
    Debe
    -0.14
     hours
    -0.14
    æľį
    -0.14
    odzi
    -0.14
    POSITIVE LOGITS
    ansom
    0.19
    sher
    0.18
    etin
    0.15
    aiser
    0.15
    endoza
    0.15
    -LAST
    0.14
    nock
    0.14
    én
    0.14
    ecome
    0.14
    bic
    0.13
    Act Density 0.048%

    No Known Activations