INDEX
    Explanations

    prohibited actions and refusal

    New Auto-Interp
    Negative Logits
     يمكنك
    0.76
     می‌توانید
    0.70
     reali
    0.70
     doğrud
    0.70
     réellement
    0.69
     होतात
    0.69
     शकतात
    0.69
    应当
    0.67
     prawd
    0.66
    \"]\
    0.66
    POSITIVE LOGITS
     exception
    1.07
    はその
    1.01
     example
    0.91
     achieve
    0.91
     exceptions
    0.91
     achieving
    0.90
     exemplifies
    0.90
    例外
    0.90
     falling
    0.88
     achieves
    0.87
    Act Density 0.090%

    No Known Activations