INDEX
    Explanations

    verbs related to actions or events

    statements about deception or manipulation

    New Auto-Interp
    Negative Logits
    accompanied
    -0.75
    fter
    -0.68
    fters
    -0.65
    critical
    -0.65
    ggles
    -0.63
    foreseen
    -0.62
     Approximately
    -0.62
    è¦ļéĨĴ
    -0.60
    uers
    -0.60
    cone
    -0.60
    POSITIVE LOGITS
     themselves
    1.39
     THEIR
    0.88
     their
    0.85
     us
    0.81
     li
    0.80
     uniforms
    0.72
    selves
    0.71
     fools
    0.70
     selves
    0.66
     me
    0.65
    Act Density 0.714%

    No Known Activations