INDEX
    Explanations

    terms related to defense and protection

    New Auto-Interp
    Negative Logits
     safety
    -0.63
     safe
    -0.61
     Safe
    -0.60
     emergency
    -0.60
     Safety
    -0.58
    SAFE
    -0.57
    Safe
    -0.56
     Shar
    -0.56
    EnableWeb
    -0.56
     безопасности
    -0.55
    POSITIVE LOGITS
     défend
    0.85
     defended
    0.79
     Defend
    0.78
     defend
    0.74
     defends
    0.74
     défendre
    0.74
     DEFEND
    0.72
     defending
    0.64
    ArrowToggle
    0.63
    ąb
    0.62
    Act Density 0.025%

    No Known Activations