INDEX
    Explanations

    Mentioned in the text

    This neuron is detecting section‐header phrases that ask to “list the organs mentioned” (i.e. instruction headings).

    New Auto-Interp
    Negative Logits
     доб
    -0.07
     Volk
    -0.07
     سبک
    -0.07
    862
    -0.07
     Karlov
    -0.07
     Mickey
    -0.06
     sexes
    -0.06
     tricky
    -0.06
     SA
    -0.06
    InOut
    -0.06
    POSITIVE LOGITS
    uyện
    0.06
    简单
    0.06
     όπου
    0.06
    .strip
    0.06
     rehabilit
    0.06
    ._
    0.06
    ];
    ↵
    0.06
    _flight
    0.06
     unpaid
    0.06
    登場
    0.06
    Act Density 0.014%

    No Known Activations