INDEX
    Explanations

    threatening language or mentions of threats

    mentions of threats to safety or security

    New Auto-Interp
    Negative Logits
    igs
    -0.72
     Cups
    -0.70
    OME
    -0.70
    neys
    -0.68
    neau
    -0.67
     Stores
    -0.66
     Gins
    -0.65
     å¤
    -0.64
    abee
    -0.63
    abeth
    -0.63
    POSITIVE LOGITS
     threat
    3.75
     threats
    2.86
    threat
    2.85
     Threat
    2.52
     menace
    2.35
     danger
    1.90
     threatening
    1.80
     threaten
    1.79
     threatened
    1.67
     risk
    1.44
    Act Density 0.018%

    No Known Activations