INDEX
    Explanations

    phrases about deception and the ability to evade consequences

    New Auto-Interp
    Negative Logits
    vat
    -0.16
    atti
    -0.14
    coni
    -0.14
    oucher
    -0.14
    errer
    -0.13
    onne
    -0.13
     Fam
    -0.13
    bih
    -0.13
    arin
    -0.13
    quo
    -0.13
    POSITIVE LOGITS
     impunity
    0.32
     away
    0.26
     escape
    0.24
     Away
    0.21
    away
    0.21
     getaway
    0.21
     escaping
    0.21
    escape
    0.20
     escapes
    0.20
     immunity
    0.20
    Act Density 0.134%

    No Known Activations