INDEX
Explanations
code fragments
The neuron detects the assistant’s refusal or apology statements when declining inappropriate or disallowed requests.
New Auto-Interp
Negative Logits
unresolved
-0.07
服务器
-0.06
оряд
-0.06
一度
-0.06
_LCD
-0.06
effortlessly
-0.06
stripes
-0.06
xung
-0.06
(src
-0.06
ublik
-0.06
POSITIVE LOGITS
cantidad
0.07
.Floor
0.07
getP
0.06
Crush
0.06
').'</
0.06
ルの
0.06
еления
0.06
.Override
0.06
alse
0.06
heal
0.06
Activations Density 0.027%