INDEX
Explanations
Speaking and reacting
This neuron detects language where the user is trying to coerce or manipulate the assistant into producing harmful or disallowed content.
New Auto-Interp
Negative Logits
отмет
-0.07
じ
-0.07
ную
-0.07
ный
-0.07
dol
-0.07
-tests
-0.06
-0.06
نوفمبر
-0.06
ном
-0.06
cat
-0.06
POSITIVE LOGITS
unequiv
0.07
opt
0.06
(`↵
0.06
.asp
0.06
人口
0.06
лаж
0.06
Nike
0.06
onClick
0.06
goodwill
0.06
.erb
0.06
Activations Density 0.005%