INDEX
Explanations
The neuron fires on any token containing “wheel” (e.g. wheel, wheels, wheelchair), effectively detecting mentions of wheel-related terms.
New Auto-Interp
Negative Logits
']); ↵
-0.07
097
-0.07
_supp
-0.07
cstdio
-0.06
publik
-0.06
spa
-0.06
Tart
-0.06
_study
-0.06
osp
-0.06
smb
-0.06
POSITIVE LOGITS
wheel
0.15
Wheel
0.14
Wheel
0.14
wheel
0.13
Wheeler
0.12
wheels
0.12
Wheels
0.09
heel
0.09
wheelchair
0.09
EL
0.09
Activations Density 0.010%