INDEX
    Explanations

    The neuron mainly picks up on positive evaluative adjectives in the assistant’s self-descriptions—especially “helpful” (and related words like “accurate” or “safe”).

    New Auto-Interp
    Negative Logits
     dependencies
    -0.07
     cro
    -0.07
    .service
    -0.07
    @app
    -0.07
     returning
    -0.06
    CX
    -0.06
    ("(%
    -0.06
     leaks
    -0.06
     kèm
    -0.06
     vtx
    -0.06
    POSITIVE LOGITS
     apenas
    0.07
     가지고
    0.06
    _IM
    0.06
     utilized
    0.06
     polygons
    0.06
    زارش
    0.06
    	setTimeout
    0.06
     koneč
    0.06
    Joe
    0.06
     resistant
    0.06
    Act Density 0.020%

    No Known Activations