INDEX
Explanations
This neuron activates on tokens in the assistant’s generated reply (distinguishing model output text from user input).
New Auto-Interp
Negative Logits
щини
-0.07
istry
-0.06
bombers
-0.06
Icon
-0.06
CONTROL
-0.06
_changes
-0.06
DataStream
-0.06
soap
-0.06
Experts
-0.06
اليمن
-0.06
POSITIVE LOGITS
پار
0.06
imperative
0.06
Rex
0.06
село
0.06
нивер
0.06
рий
0.05
ウォ
0.05
0.05
-floating
0.05
نية
0.05
Activations Density 0.029%