INDEX
Explanations
The neuron is essentially “dead” in these examples—it never activates on any token, so it isn’t detecting any pattern.
New Auto-Interp
Negative Logits
(permission
-0.07
Jon
-0.07
Jon
-0.07
Rocket
-0.07
Hakk
-0.07
Collector
-0.07
hayır
-0.07
333
-0.07
produkt
-0.07
ahir
-0.06
POSITIVE LOGITS
Sarah
0.08
Moodle
0.07
sitting
0.07
stew
0.07
desk
0.07
Cristina
0.07
لل
0.07
Sarah
0.06
intoler
0.06
stressful
0.06
Activations Density 0.006%