INDEX
    Explanations

    Categories/Metadata

    The neuron strongly activates on tokens that appear as Wikipedia category labels (the “Category:” lines at the end of an article).

    New Auto-Interp
    Negative Logits
    807
    -0.06
     пись
    -0.06
     von
    -0.06
     entropy
    -0.06
     сл
    -0.06
    423
    -0.06
    』(
    -0.06
     Angiosper
    -0.06
     سیاست
    -0.06
     Сем
    -0.06
    POSITIVE LOGITS
       
    0.07
    lal
    0.07
     anv
    0.06
     AND
    0.06
    eval
    0.06
    uteč
    0.06
    ={}
    0.06
    ↵    ↵    ↵
    0.06
    .Program
    0.06
    ocide
    0.06
    Act Density 0.024%

    No Known Activations