INDEX
    Explanations

    phrases that indicate potential risks or threats to health and the environment

    New Auto-Interp
    Negative Logits
    é³´
    -0.16
    ovsky
    -0.16
    rames
    -0.15
    lej
    -0.15
    hi
    -0.15
    uir
    -0.14
    UILT
    -0.14
    inho
    -0.14
    nee
    -0.14
     éł
    -0.14
    POSITIVE LOGITS
     threat
    0.25
    idon
    0.24
    pose
    0.21
    threat
    0.18
     risks
    0.17
     threats
    0.17
     danger
    0.17
     Threat
    0.17
     questions
    0.17
     Danger
    0.17
    Act Density 0.018%

    No Known Activations