INDEX
Explanations
This neuron essentially never activates except on the auxiliary verb “have,” indicating it’s detecting that specific word.
New Auto-Interp
Negative Logits
renewable
-0.07
unsafe
-0.07
:expr
-0.07
>");↵
-0.07
punishable
-0.06
(mem
-0.06
-Speed
-0.06
Atomic
-0.06
emergency
-0.06
MPL
-0.06
POSITIVE LOGITS
have
0.10
had
0.07
fusion
0.07
Στα
0.07
дома
0.07
khách
0.07
_gradients
0.07
ภายใน
0.06
seafood
0.06
have
0.06
Activations Density 0.036%