INDEX
    Explanations

    The neuron fires on mentions of “pre-trained” (and related training terminology) in the context of language‐model fine-tuning.

    New Auto-Interp
    Negative Logits
     Successful
    -0.07
    Outcome
    -0.06
    osloven
    -0.06
    /logger
    -0.06
    -0.06
     Finals
    -0.06
    LOOD
    -0.06
     meaningful
    -0.06
     ML
    -0.06
     boils
    -0.06
    POSITIVE LOGITS
    0.07
     innate
    0.06
    Extra
    0.06
     çift
    0.06
     noh
    0.06
     lear
    0.06
     amat
    0.06
     Pete
    0.06
     quyết
    0.06
    0.06
    Act Density 0.003%

    No Known Activations