INDEX
    Explanations

    phrases indicating directness or explicitness

    New Auto-Interp
    Negative Logits
    eport
    -0.82
    aido
    -0.76
    pload
    -0.73
    emis
    -0.72
    isure
    -0.70
     Pastebin
    -0.69
    kees
    -0.68
    =-=-=-=-=-=-=-=-
    -0.68
    lain
    -0.68
    nan
    -0.65
    POSITIVE LOGITS
     rejection
    0.77
     refusal
    0.70
     contradicted
    0.70
     obliter
    0.69
     contradicts
    0.69
     contradict
    0.68
     disregard
    0.68
    ERROR
    0.68
     lie
    0.66
    butt
    0.65
    Act Density 0.091%

    No Known Activations