INDEX
Explanations
articles
This neuron detects mentions of training language models (e.g. “train a model,” “training a language model”).
New Auto-Interp
Negative Logits
ova
-0.07
nearer
-0.07
كتب
-0.07
Areas
-0.07
Activities
-0.06
Jean
-0.06
voor
-0.06
rian
-0.06
stools
-0.06
忙
-0.06
POSITIVE LOGITS
_mono
0.07
emouth
0.07
FS
0.06
هواپیم
0.06
tolist
0.06
bedtime
0.06
<>↵
0.06
#-}↵↵
0.06
&);↵↵
0.06
)x
0.06
Activations Density 0.029%