INDEX
    Explanations

    references to dishonesty and deception

    New Auto-Interp
    Negative Logits
     Mu
    -0.63
    ClientSize
    -0.61
     A
    -0.56
    帖最后由
    -0.56
     “
    -0.54
    ↵↵
    -0.54
     massimo
    -0.53
     a
    -0.51
     Gra
    -0.51
    Mu
    -0.50
    POSITIVE LOGITS
     lie
    1.45
     liar
    1.45
     Lying
    1.41
     lies
    1.36
     lied
    1.34
     LIE
    1.33
     lying
    1.31
     liars
    1.30
     Liar
    1.26
    Lies
    1.23
    Act Density 0.194%

    No Known Activations