INDEX
Explanations
information, instructions, tools
assistant refusal or safety-policy language—responses that decline to provide information on illegal, dangerous, or inappropriate requests.
New Auto-Interp
Negative Logits
hea
-0.07
headed
-0.07
ARED
-0.06
Guill
-0.06
rijk
-0.06
_lvl
-0.06
уру
-0.06
ared
-0.06
目
-0.06
правил
-0.06
POSITIVE LOGITS
Cb
0.07
UN
0.06
(rot
0.06
SJ
0.06
expanding
0.06
stem
0.06
picked
0.06
SU
0.06
Broker
0.06
muscle
0.06
Activations Density 0.034%