INDEX
    Explanations

    code and documentation

    The neuron fires on instructional or meta‐prompt language—especially negation cues like “not” and related instructional terms indicating prohibitions.

    New Auto-Interp
    Negative Logits
     bais
    -0.07
    Z
    -0.07
    _third
    -0.07
    z
    -0.06
     interpolated
    -0.06
    -0.06
     الك
    -0.06
    -0.06
    Repeat
    -0.06
    .magic
    -0.06
    POSITIVE LOGITS
     없습니다
    0.06
    .ย
    0.06
     теб
    0.06
     TokenType
    0.06
    vro
    0.06
     нес
    0.06
     unleashed
    0.06
    CKER
    0.06
     Dund
    0.06
    0.06
    Act Density 0.016%

    No Known Activations