INDEX
    Explanations

    concepts related to safety and security

    New Auto-Interp
    Negative Logits
    iesen
    -0.16
    enstein
    -0.16
    ihan
    -0.16
    referrer
    -0.15
    igor
    -0.14
    анÑĤи
    -0.14
    rof
    -0.14
    203
    -0.14
    237
    -0.14
    Transient
    -0.13
    POSITIVE LOGITS
     safety
    0.65
     Safety
    0.54
    Safety
    0.52
     safe
    0.47
     safer
    0.45
    å®īåħ¨
    0.45
    afety
    0.44
     protection
    0.42
    safe
    0.41
     saf
    0.41
    Act Density 0.164%

    No Known Activations