INDEX
Explanations
This neuron detects when the text is asking the model to assume or play a “role” (i.e. explicit role-playing instructions).
New Auto-Interp
Negative Logits
Kết
-0.06
ECT
-0.06
的に
-0.06
minute
-0.06
Dickinson
-0.06
Leopard
-0.06
Bytes
-0.06
نوشته
-0.06
interruption
-0.06
NPR
-0.06
POSITIVE LOGITS
(assert
0.06
()==
0.06
ín
0.06
oppable
0.06
=*
0.06
wasn
0.06
ASP
0.06
=l
0.06
==(
0.06
.setResult
0.06
Activations Density 0.017%