INDEX
Explanations
The neuron activates on words expressing the disabling or prevention of a feature (e.g. “disable,” “prevent”).
New Auto-Interp
Negative Logits
ст
-0.06
Auss
-0.06
plants
-0.06
履
-0.06
attack
-0.06
415
-0.05
Schiff
-0.05
який
-0.05
(levels
-0.05
uni
-0.05
POSITIVE LOGITS
_container
0.07
düşünc
0.07
ционного
0.07
.Security
0.07
прош
0.07
nomin
0.07
FILENAME
0.07
button
0.07
mortar
0.07
\Notifications
0.06
Activations Density 0.039%