INDEX
    Explanations

    instances of refusal or denial in various contexts

    New Auto-Interp
    Negative Logits
    patch
    -0.15
    olle
    -0.15
     سÙĦاÙħ
    -0.14
    .getOwnProperty
    -0.14
    Reserved
    -0.14
    385
    -0.14
     lesbienne
    -0.14
    alli
    -0.14
    abus
    -0.14
    ãģ¤ãģ¶
    -0.14
    POSITIVE LOGITS
     let
    0.20
     allow
    0.18
     accepting
    0.17
     letting
    0.17
     admit
    0.17
     accept
    0.17
     accepts
    0.17
     Accept
    0.16
     allowing
    0.16
     unless
    0.15
    Act Density 0.065%

    No Known Activations