INDEX
    Explanations

    instances of denial or refusal related to various topics

    New Auto-Interp
    Negative Logits
    /INFO
    -0.15
    itz
    -0.14
    RTL
    -0.14
    ientos
    -0.14
    umlu
    -0.14
     Spectrum
    -0.14
    nelle
    -0.14
    SCR
    -0.13
    leh
    -0.13
    ie
    -0.13
    POSITIVE LOGITS
    uga
    0.18
    egal
    0.17
    ecure
    0.17
    arat
    0.15
    issance
    0.15
    igma
    0.15
     gettext
    0.14
    á»ĩu
    0.14
    oux
    0.14
    arges
    0.14
    Act Density 0.063%

    No Known Activations