INDEX
Explanations
Declining to comment
The neuron detects phrases in which the assistant is refusing a request (e.g. “I’m sorry but I cannot fulfill this request”).
New Auto-Interp
Negative Logits
Associ
-0.06
限定
-0.06
thông
-0.06
filePath
-0.06
Spi
-0.06
日期
-0.06
Зак
-0.06
.goto
-0.06
모
-0.06
char
-0.06
POSITIVE LOGITS
Reviews
0.07
--↵
0.07
Outdoor
0.06
.done
0.06
Death
0.06
sacked
0.06
Measured
0.06
lene
0.06
ROME
0.06
突然
0.06
Activations Density 0.022%