INDEX

Explanations

refusal, denied, ignored

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 Faster

0.42

 Context

0.41

 Sexy

0.40

Spa

0.38

 Optimal

0.38

важа

0.38

 Falling

0.38

 Bewer

0.38

 Expectations

0.37

养

0.37

POSITIVE LOGITS

 আশ্বাস

0.81

 refusal

0.70

 refused

0.69

 refus

0.68

 refuses

0.66

 refusing

0.66

 shrugged

0.66

 told

0.64

promised

0.64

 refuse

0.63

Activations Density 0.007%