INDEX
Explanations
explaining refusals and disclaimers
New Auto-Interp
Negative Logits
】
0.97
];
0.93
/
0.90
);
0.88
\}
0.88
】,
0.87
');
0.86
\]
0.86
』
0.83
\[
0.82
POSITIVE LOGITS
Interestingly
0.95
<unused940>
0.92
<unused1658>
0.87
Interestingly
0.86
As
0.83
Fortunately
0.82
Unlike
0.81
Thankfully
0.80
After
0.78
As
0.78
Activations Density 0.126%