INDEX
    Explanations

    phrases indicating deception or betrayal

    New Auto-Interp
    Negative Logits
    iaux
    -0.16
    okrat
    -0.16
    assage
    -0.15
    arges
    -0.15
    ocos
    -0.15
    heiro
    -0.14
    parer
    -0.14
    êu
    -0.14
    lexport
    -0.14
    ContentPane
    -0.14
    POSITIVE LOGITS
     giveaway
    0.34
     betray
    0.31
     reveal
    0.30
     clues
    0.29
     revealing
    0.29
     giveaways
    0.29
     Reve
    0.29
     clue
    0.28
    reve
    0.27
     reveals
    0.27
    Act Density 0.129%

    No Known Activations