INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    _regs
    -0.07
     filho
    -0.07
    _LD
    -0.06
     snad
    -0.06
     }
    ↵
    ↵
    -0.06
     ','
    -0.06
    .same
    -0.06
    _bank
    -0.06
     wildfires
    -0.06
    -framework
    -0.06
    POSITIVE LOGITS
     deceived
    0.10
     deception
    0.09
     deceptive
    0.08
     deceive
    0.08
     deceit
    0.06
     dece
    0.06
    [sub
    0.06
    ookie
    0.06
    psilon
    0.06
     reasoning
    0.06
    Act Density 0.009%

    No Known Activations