INDEX
    Explanations

    terms associated with dishonesty and political narratives

    New Auto-Interp
    Negative Logits
    ungi
    -0.16
    CHandle
    -0.15
    ewis
    -0.14
    ungan
    -0.14
    luk
    -0.14
    byt
    -0.14
    eree
    -0.14
    (Handle
    -0.14
    ehler
    -0.14
    िषय
    -0.14
    POSITIVE LOGITS
     lie
    0.68
     lies
    0.66
     lying
    0.61
     Lie
    0.56
    lie
    0.53
     Lies
    0.52
    Lie
    0.49
     fib
    0.48
     lied
    0.47
     liar
    0.47
    Act Density 0.236%

    No Known Activations