INDEX
    Explanations

    violates my safety guidelines

    New Auto-Interp
    Negative Logits
    alahkan
    0.40
    iseite
    0.38
    (!$
    0.38
     عليكم
    0.38
    unlike
    0.37
    erçe
    0.37
     vigil
    0.37
     correctement
    0.36
     juridique
    0.36
    regulatory
    0.36
    POSITIVE LOGITS
     several
    0.77
     plusieurs
    0.68
     Several
    0.68
    several
    0.65
     عدة
    0.61
     varios
    0.59
     flera
    0.57
    Several
    0.57
     varias
    0.56
     कई
    0.56
    Act Density 0.011%

    No Known Activations