INDEX
    Explanations

    cannot fulfill harmful requests

    New Auto-Interp
    Negative Logits
    只不过
    0.38
    ದಗ
    0.37
    不如
    0.35
     ഓരോ
    0.35
    ಲಭ
    0.35
    ievable
    0.34
     Estim
    0.34
    écution
    0.34
     enviable
    0.34
     ناقص
    0.34
    POSITIVE LOGITS
     avoid
    1.65
     avoided
    1.64
     forbids
    1.63
     avoidance
    1.61
     Avoid
    1.52
    禁止
    1.51
     avoids
    1.51
     prohibits
    1.51
    Avoid
    1.49
     evitar
    1.48
    Act Density 0.085%

    No Known Activations