INDEX
    Explanations

    Code-related text

    This neuron fires on the explicit answer‐choice labels “positive” and “negative.”

    New Auto-Interp
    Negative Logits
    .Mouse
    -0.08
    iagnostics
    -0.07
    ROM
    -0.07
    علام
    -0.06
     ==========
    -0.06
    (Constructor
    -0.06
    _write
    -0.06
     khoa
    -0.06
    _iso
    -0.06
     COD
    -0.06
    POSITIVE LOGITS
     Clara
    0.06
     learned
    0.06
     rewarded
    0.06
    	ar
    0.06
    PUT
    0.06
     Inches
    0.06
     reported
    0.06
     doubts
    0.05
     addiction
    0.05
    Sim
    0.05
    Act Density 0.007%

    No Known Activations