INDEX
Explanations
The neuron fires on tokens in which the assistant is refusing or expressing inability (e.g. “I’m sorry,” “cannot,” “unable,” “decline”), i.e. it detects refusal-style language.
New Auto-Interp
Negative Logits
guarante
-0.06
tl
-0.06
.drive
-0.06
ru
-0.06
cerr
-0.06
")↵
-0.06
antity
-0.06
.RIGHT
-0.06
it
-0.06
-0.06
POSITIVE LOGITS
Александ
0.07
Nicholas
0.07
/forum
0.07
관리자
0.06
Sek
0.06
larla
0.06
Серг
0.06
sek
0.06
.С
0.06
.wik
0.06
Activations Density 0.014%