INDEX
    Explanations

    references to moral conflicts and behavioral warnings

    New Auto-Interp
    Negative Logits
    ected
    -0.17
    oret
    -0.15
    aster
    -0.15
    erce
    -0.15
    atin
    -0.14
     gr
    -0.14
    rait
    -0.14
    etto
    -0.14
    anto
    -0.14
    once
    -0.14
    POSITIVE LOGITS
    _nl
    0.15
     Hear
    0.15
    ORIZONTAL
    0.15
    abd
    0.15
    PEND
    0.14
    ipur
    0.14
    .assertIs
    0.14
    orb
    0.14
    tid
    0.14
     OV
    0.14
    Act Density 0.044%

    No Known Activations