INDEX
Explanations
The neuron fires on mentions of “pre-trained” (and related training terminology) in the context of language‐model fine-tuning.
New Auto-Interp
Negative Logits
Successful
-0.07
Outcome
-0.06
osloven
-0.06
/logger
-0.06
输
-0.06
Finals
-0.06
LOOD
-0.06
meaningful
-0.06
ML
-0.06
boils
-0.06
POSITIVE LOGITS
�
0.07
innate
0.06
Extra
0.06
çift
0.06
noh
0.06
lear
0.06
amat
0.06
Pete
0.06
quyết
0.06
că
0.06
Activations Density 0.003%