INDEX
Explanations
Avoiding spam or unnecessary content
content related to sexual themes and adult situations.
This neuron activates on imperative instruction words from the system guidelines (e.g., "explain," "repeat," "output," "irrelevant," "yourself," "answers"), effectively detecting parts of the policy that tell the model not to perform certain behaviors.
New Auto-Interp
Negative Logits
fine
-0.06
_master
-0.06
nomine
-0.06
Pass
-0.06
ICON
-0.06
мін
-0.06
clips
-0.06
nine
-0.06
stitch
-0.06
intr
-0.06
POSITIVE LOGITS
_ComCallableWrapper
0.07
gerek
0.07
\E
0.07
/|
0.06
action
0.06
codec
0.06
ократи
0.06
(Api
0.06
็ว
0.06
рань
0.06
Activations Density 0.001%