INDEX
    Explanations

    explicit violence or crime-related details in a news context

    New Auto-Interp
    Negative Logits
    »Ĵ
    -0.60
    ascript
    -0.58
    uously
    -0.57
    displayText
    -0.57
    izoph
    -0.57
     Tanz
    -0.57
    irtual
    -0.56
    itures
    -0.56
    iasm
    -0.55
    ENDED
    -0.55
    POSITIVE LOGITS
    wen
    0.71
    ghan
    0.66
    ewater
    0.66
    ster
    0.65
    erville
    0.64
    coe
    0.61
    wyn
    0.61
    sters
    0.60
    heit
    0.60
    vel
    0.58
    Act Density 0.100%

    No Known Activations