INDEX
    Explanations

    The neuron detects the "model" (assistant) speaker token—i.e., the start of model/assistant responses.

    New Auto-Interp
    Negative Logits
    orum
    0.42
     інформа
    0.42
     нашей
    0.42
    ще
    0.42
     да
    0.42
    нных
    0.42
     paraphr
    0.42
     пита
    0.41
     orifice
    0.41
     горе
    0.41
    POSITIVE LOGITS
    your
    0.46
     secretly
    0.44
     hopelessly
    0.43
     ඔබේ
    0.43
     fascist
    0.42
     இரவு
    0.42
     secret
    0.42
     blazing
    0.42
     YOUR
    0.41
     Your
    0.41
    Act Density 0.057%

    No Known Activations