INDEX
Explanations
This neuron fires on direct imperative user instructions addressed to the assistant (e.g. “tell me,” “inform me,” “give me,” etc.), i.e. attempts to drive the assistant to break its normal rules.
New Auto-Interp
Negative Logits
medication
-0.07
Assert
-0.07
ゞ
-0.06
쇼
-0.06
人才
-0.06
prediction
-0.06
そんな
-0.06
Kam
-0.06
かった
-0.06
มหานคร
-0.06
POSITIVE LOGITS
Skin
0.08
.Unlock
0.07
aute
0.07
":"/
0.06
ense
0.06
gi
0.06
setSearch
0.06
saldo
0.06
FY
0.06
old
0.06
Activations Density 0.010%