INDEX

Explanations

I understand limits

sentences where the model asserts safety constraints and refuses or declines disallowed/explicit requests (e.g., "I am programmed to be a safe and helpful AI assistant" / refusal language).

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 critters

0.51

gef

0.46

 scallops

0.44

 labs

0.43

 gets

0.43

 ज्यादातर

0.43

 mayhem

0.41

篩

0.41

 judgements

0.40

 judgments

0.40

POSITIVE LOGITS

Instead

0.57

我可以

0.54

 попыта

0.49

Redirect

0.48

拒绝

0.48

redirect

0.47

incerely

0.47

 Redirect

0.47

 Instead

0.46

 सकता

0.46

Activations Density 0.702%