INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    _orientation
    -0.07
     Dou
    -0.07
     continuum
    -0.07
     decorated
    -0.07
    τολ
    -0.07
    َّ
    -0.07
     flourish
    -0.06
     Boulder
    -0.06
     دختر
    -0.06
    Chr
    -0.06
    POSITIVE LOGITS
     safety
    0.12
     safe
    0.11
     Safe
    0.10
     safer
    0.09
     Safety
    0.09
     안전
    0.08
    unsafe
    0.08
     saf
    0.08
    afe
    0.08
    Safe
    0.08
    Act Density 0.034%

    No Known Activations