INDEX
Explanations
The neuron specifically lights up on words containing the root “trap” (e.g. trap, trapping, trapper, trapdoors, traps).
New Auto-Interp
Negative Logits
Miche
-0.07
Cole
-0.07
这样的
-0.06
oud
-0.06
ied
-0.06
cole
-0.06
شهرد
-0.06
mileage
-0.06
Cleveland
-0.06
date
-0.06
POSITIVE LOGITS
Trap
0.15
trap
0.13
trapped
0.12
traps
0.11
Trap
0.10
trap
0.09
trapping
0.09
陷
0.07
rap
0.07
strap
0.07
Activations Density 0.005%