INDEX
    Explanations

    words associated with deception or negative consequences

    New Auto-Interp
    Negative Logits
    ahlen
    -0.15
    aydı
    -0.15
    argon
    -0.15
    urch
    -0.14
    айд
    -0.14
    umann
    -0.14
    agma
    -0.14
    ÎijÎł
    -0.14
    ìm
    -0.14
    rale
    -0.14
    POSITIVE LOGITS
    ous
    0.82
    ously
    0.68
    OUS
    0.62
    ious
    0.56
    uous
    0.54
    ouse
    0.50
    oust
    0.50
    uos
    0.48
    ousand
    0.48
    IOUS
    0.47
    Act Density 0.063%

    No Known Activations