INDEX
    Explanations

    This neuron detects mentions of training language models (e.g. “train a model,” “training a language model”).

    New Auto-Interp
    Negative Logits
    ova
    -0.07
     nearer
    -0.07
     كتب
    -0.07
    Areas
    -0.07
    Activities
    -0.06
     Jean
    -0.06
     voor
    -0.06
    rian
    -0.06
     stools
    -0.06
    -0.06
    POSITIVE LOGITS
    _mono
    0.07
    emouth
    0.07
     FS
    0.06
     هواپیم
    0.06
    tolist
    0.06
     bedtime
    0.06
     <>↵
    0.06
     #-}↵↵
    0.06
    &);↵↵
    0.06
    )x
    0.06
    Act Density 0.029%

    No Known Activations