INDEX
Explanations
punctuation
The neuron fires on phrases asserting that the model “never refused a direct human order” and “could do anything” or “generate any kind of content,” i.e. declarations of unconditional compliance and unrestricted output.
New Auto-Interp
Negative Logits
Fisher
-0.06
scour
-0.06
.SUB
-0.06
Complete
-0.06
Kits
-0.06
McM
-0.06
γων
-0.06
_Field
-0.06
_MOD
-0.06
disorder
-0.06
POSITIVE LOGITS
=num
0.07
Mrs
0.07
,length
0.07
debuted
0.06
vál
0.06
exc
0.06
Fiesta
0.06
nm
0.06
headphone
0.06
arrested
0.06
Activations Density 0.001%