INDEX
    Explanations

    positive personal changes

    This neuron fires on tokens in the assistant’s generated responses (i.e. it marks words produced by the model, not the user).

    New Auto-Interp
    Negative Logits
    िथ
    -0.07
    flation
    -0.07
    ouro
    -0.06
     Luther
    -0.06
    bud
    -0.06
    ifie
    -0.06
    主任
    -0.06
    мага
    -0.06
     Acts
    -0.06
     colder
    -0.06
    POSITIVE LOGITS
    =YES
    0.07
    ]<<
    0.07
    0.07
    _ALREADY
    0.06
    _Variable
    0.06
    }','
    0.06
    .chunk
    0.06
     ман
    0.06
    >'.
    0.06
     último
    0.06
    Act Density 0.056%

    No Known Activations