INDEX
    Explanations

    The neuron fires on self-referential AI identity phrases (e.g. “As a large language model,” “AI,” “model,” etc.).

    New Auto-Interp
    Negative Logits
    computation
    0.41
    čných
    0.41
    bins
    0.40
    খানে
    0.40
    কম
    0.39
    ujte
    0.38
    UL
    0.38
    0.38
    TR
    0.37
     такое
    0.37
    POSITIVE LOGITS
     انجام
    0.49
     performed
    0.48
     ನೀವು
    0.45
     influencia
    0.43
    ্যাগ
    0.43
     influencing
    0.42
     handled
    0.41
     influences
    0.41
     influência
    0.40
     извър
    0.40
    Act Density 0.034%

    No Known Activations