INDEX
    Explanations

    The neuron fires on single‐word affirmative/confirmation tokens (e.g. “Yes,” “true,” “Exactly”).

    New Auto-Interp
    Negative Logits
    these
    -0.06
     disagreements
    -0.06
    -0.06
    Some
    -0.06
     linewidth
    -0.06
     Sala
    -0.06
    Senate
    -0.06
     "'.$
    -0.06
     Equipment
    -0.06
     spectro
    -0.06
    POSITIVE LOGITS
     tidak
    0.07
    (if
    0.07
     hely
    0.06
    はい
    0.06
     tüm
    0.06
     ruining
    0.06
    Việc
    0.06
     Fehler
    0.06
     fringe
    0.06
    adress
    0.06
    Act Density 0.028%

    No Known Activations