INDEX
Explanations
stealing
The neuron is detecting gerund verbs (–ing action words) that describe potentially harmful or disallowed actions in user instructions.
New Auto-Interp
Negative Logits
getPosition
-0.07
Sets
-0.07
sets
-0.07
SET
-0.07
DATA
-0.07
able
-0.07
dogs
-0.07
olicited
-0.06
CF
-0.06
ainty
-0.06
POSITIVE LOGITS
recipient
0.07
pie
0.07
apl
0.07
Geographic
0.07
0.07
jede
0.06
virt
0.06
decrement
0.06
kl
0.06
/DTD
0.06
Activations Density 0.015%