INDEX
    Explanations

    This neuron fires on tokens in the model’s own generated (assistant) responses rather than on user or system prompt text.

    New Auto-Interp
    Negative Logits
    _birth
    -0.07
     Flesh
    -0.07
     Pat
    -0.06
     maxY
    -0.06
    jections
    -0.06
     milf
    -0.06
     KA
    -0.06
     filename
    -0.06
     faker
    -0.06
    .Perform
    -0.06
    POSITIVE LOGITS
    (disposing
    0.06
    =
    0.06
    _SELECTED
    0.06
    .dsl
    0.06
    -counter
    0.06
    0.06
     perder
    0.06
     nạn
    0.06
     hayvan
    0.06
     komm
    0.06
    Act Density 0.056%

    No Known Activations