INDEX
    Explanations

    concepts related to safety and security

    New Auto-Interp
    Negative Logits
     Safety
    -0.45
     safety
    -0.45
     safely
    -0.44
    Safety
    -0.42
     safer
    -0.39
     safest
    -0.36
     saf
    -0.35
    safe
    -0.35
     Safe
    -0.34
     safe
    -0.34
    POSITIVE LOGITS
     sound
    0.22
     Sound
    0.21
    Sound
    0.20
     Secure
    0.18
    Sec
    0.17
    sound
    0.17
     secure
    0.17
     sec
    0.17
    erville
    0.17
     SOUND
    0.17
    Act Density 0.028%

    No Known Activations