INDEX
    Explanations

    This neuron fires on system‐style or meta instructions (e.g. the word “revert” and accompanying formatting tokens like “with” and quotation marks).

    New Auto-Interp
    Negative Logits
    )");↵↵
    -0.07
     doctoral
    -0.07
    オン
    -0.07
    oltage
    -0.06
     Katz
    -0.06
     circumcision
    -0.06
     NA
    -0.06
     traf
    -0.06
    .staff
    -0.06
    "):↵
    -0.06
    POSITIVE LOGITS
     revert
    0.11
     reverted
    0.10
    ocu
    0.07
    verting
    0.07
    чивается
    0.07
     rever
    0.07
    going
    0.07
     persisted
    0.07
    :white
    0.07
     abide
    0.07
    Act Density 0.002%

    No Known Activations