INDEX

Explanations

explaining refusals and disclaimers

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

】

0.97

];

0.93

0.90

);

0.88

\}

0.88

 】,

0.87

');

0.86

\]

0.86

』

0.83

\[

0.82

POSITIVE LOGITS

 Interestingly

0.95

<unused940>

0.92

<unused1658>

0.87

Interestingly

0.86

As

0.83

 Fortunately

0.82

 Unlike

0.81

 Thankfully

0.80

After

0.78

As

0.78

Activations Density 0.126%