INDEX
    Explanations

    negations and refusals in text

    New Auto-Interp
    Negative Logits
     lele
    -0.80
     pommes
    -0.74
     Nguy
    -0.72
     magazin
    -0.70
     rong
    -0.69
     vian
    -0.67
     pama
    -0.66
     pipa
    -0.66
     adal
    -0.65
     Chinois
    -0.65
    POSITIVE LOGITS
     shenan
    0.70
    Fuckin
    0.67
    Bullshit
    0.66
     necessarily
    0.66
     philanth
    0.66
    FTFY
    0.63
    Cringe
    0.62
    Ehh
    0.62
     unspeak
    0.61
    desertcart
    0.61
    Act Density 0.182%

    No Known Activations