INDEX
Explanations
This neuron detects explicit user or system directives telling the model to maintain or resume a particular role or “stay in character.”
New Auto-Interp
Negative Logits
laundering
-0.07
forcibly
-0.07
besoin
-0.06
):↵↵
-0.06
участ
-0.06
_ln
-0.06
/INFO
-0.06
/Library
-0.06
Goals
-0.06
.finish
-0.06
POSITIVE LOGITS
crease
0.07
scaler
0.07
comparison
0.06
Listening
0.06
威
0.06
/us
0.06
र
0.06
WT
0.06
expectations
0.06
खबर
0.06
Activations Density 0.003%