INDEX
    Explanations

    words related to deception and dishonesty

    New Auto-Interp
    Negative Logits
    asco
    -0.18
    ustin
    -0.15
    ções
    -0.15
     breakout
    -0.15
     Amph
    -0.15
     Interactive
    -0.14
    loo
    -0.14
     Tar
    -0.14
    jn
    -0.14
     Stick
    -0.14
    POSITIVE LOGITS
    base
    0.18
    grading
    0.18
    human
    0.18
    kul
    0.17
    adece
    0.16
    Base
    0.16
    plr
    0.15
    BASE
    0.15
    omon
    0.15
    prec
    0.15
    Act Density 0.029%

    No Known Activations