INDEX
    Explanations

    words related to deception or dishonesty

    New Auto-Interp
    Negative Logits
    interrupted
    -0.82
    nas
    -0.72
    IO
    -0.69
    ENTS
    -0.69
    LESS
    -0.67
    YL
    -0.67
    ians
    -0.67
    cake
    -0.66
    IAN
    -0.65
    upon
    -0.65
    POSITIVE LOGITS
    azy
    1.28
    eker
    1.08
    igh
    1.05
    uth
    0.97
    eper
    0.93
    asure
    0.90
    avement
    0.89
    pload
    0.87
    aving
    0.85
    aping
    0.84
    Act Density 0.019%

    No Known Activations