INDEX
    Explanations

    incidents involving violence or injury

    New Auto-Interp
    Negative Logits
    asc
    -0.16
    ouch
    -0.16
     oÄį
    -0.15
     uncomment
    -0.14
     Uncomment
    -0.14
    itti
    -0.14
    ascar
    -0.14
    ograd
    -0.14
    andy
    -0.14
    ascript
    -0.14
    POSITIVE LOGITS
    ihu
    0.18
    uka
    0.15
    errat
    0.15
    linkplain
    0.15
    elib
    0.14
    ting
    0.14
    iasi
    0.14
    ilis
    0.14
    elu
    0.13
    emale
    0.13
    Act Density 0.413%

    No Known Activations