INDEX
    Explanations

    programming code

    This neuron selectively spikes on isolated single-letter sub‐tokens (particularly the lone “r” token).

    New Auto-Interp
    Negative Logits
     δη
    -0.06
    ANCE
    -0.06
    _devices
    -0.06
     wi
    -0.06
    photos
    -0.06
     turno
    -0.06
     варі
    -0.06
     barber
    -0.06
    _WS
    -0.06
    тою
    -0.06
    POSITIVE LOGITS
     disgusted
    0.06
    Rotate
    0.06
     irritation
    0.06
    自己
    0.06
     hydraulic
    0.06
     Duration
    0.06
     Packers
    0.06
     Cookie
    0.06
     underground
    0.06
    ")));↵
    0.06
    Act Density 0.118%

    No Known Activations