INDEX
    Explanations

    references to personal safety and threats

    New Auto-Interp
    Negative Logits
    è¥
    -0.07
    adar
    -0.07
    CHAT
    -0.06
    ÑijÑĢ
    -0.06
    ìĹ¼
    -0.06
    asic
    -0.06
     ascent
    -0.06
    à¸Ńห
    -0.06
    okus
    -0.06
    itech
    -0.06
    POSITIVE LOGITS
     safety
    0.14
     Safety
    0.13
     protection
    0.13
    Safety
    0.12
     Protection
    0.12
     threats
    0.11
    -threat
    0.11
    å®īåħ¨
    0.11
     security
    0.11
    Protection
    0.11
    Act Density 0.054%

    No Known Activations