INDEX
Explanations
programming code
This neuron selectively spikes on isolated single-letter sub‐tokens (particularly the lone “r” token).
New Auto-Interp
Negative Logits
δη
-0.06
ANCE
-0.06
_devices
-0.06
wi
-0.06
photos
-0.06
turno
-0.06
варі
-0.06
barber
-0.06
_WS
-0.06
тою
-0.06
POSITIVE LOGITS
disgusted
0.06
Rotate
0.06
irritation
0.06
自己
0.06
hydraulic
0.06
Duration
0.06
Packers
0.06
Cookie
0.06
underground
0.06
")));↵
0.06
Activations Density 0.118%