INDEX
    Explanations

    This neuron activates on tokens in the assistant’s generated reply (distinguishing model output text from user input).

    New Auto-Interp
    Negative Logits
    щини
    -0.07
    istry
    -0.06
     bombers
    -0.06
    Icon
    -0.06
     CONTROL
    -0.06
    _changes
    -0.06
    DataStream
    -0.06
     soap
    -0.06
    Experts
    -0.06
     اليمن
    -0.06
    POSITIVE LOGITS
     پار
    0.06
     imperative
    0.06
     Rex
    0.06
     село
    0.06
    нивер
    0.06
    рий
    0.05
    ウォ
    0.05
    						    
    0.05
    -floating
    0.05
    نية
    0.05
    Act Density 0.029%

    No Known Activations