INDEX
    Explanations

    phrases that imply deception or manipulation in communication

    New Auto-Interp
    Negative Logits
    opsis
    -0.16
     boz
    -0.16
    roz
    -0.15
    ellas
    -0.15
    uen
    -0.15
    lem
    -0.14
    uptools
    -0.14
    ument
    -0.14
    ipa
    -0.14
     Comm
    -0.14
    POSITIVE LOGITS
    etti
    0.17
    меж
    0.15
     Alv
    0.15
    -Ta
    0.14
    oders
    0.14
    aghan
    0.14
    .getP
    0.14
    λικά
    0.14
    ees
    0.13
    Ỽp
    0.13
    Act Density 0.206%

    No Known Activations