INDEX
    Explanations

    requests or instructions written with a polite tone

    occurrences of the word "please" in various formats

    New Auto-Interp
    Negative Logits
    lings
    -0.76
    pires
    -0.74
    é¾
    -0.74
    arthed
    -0.71
    arc
    -0.70
    cler
    -0.69
    IUM
    -0.65
    MpServer
    -0.65
    visor
    -0.64
    ARC
    -0.62
    POSITIVE LOGITS
     Ignore
    1.07
     beware
    1.04
     forgive
    1.03
     note
    0.98
     ignore
    0.95
     advise
    0.93
     excuse
    0.92
     refrain
    0.92
     enable
    0.89
     consider
    0.88
    Act Density 0.033%

    No Known Activations