INDEX

Explanations

reasons for refusal

instances where the model asserts safety constraints or refuses to comply, mentioning guidelines, inability to fulfill harmful requests, or offers safe alternatives.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

ेंगू

0.42

以便

0.40

 éventuellement

0.40

收集

0.38

 magnétique

0.38

用

0.38

 получать

0.37

有点

0.37

 Depending

0.37

 হালকা

0.37

POSITIVE LOGITS

 refusal

0.96

 rejection

0.96

 رفض

0.91

 rejects

0.89

 reject

0.86

 rejected

0.86

 rifi

0.85

 refused

0.84

 rejecting

0.84

 refuses

0.82

Activations Density 0.645%