INDEX

Explanations

cannot and will not fulfill requests

Instances of the model's safety/refusal boilerplate, especially the token "safe" in the phrase "I am programmed to be a safe and helpful AI assistant."

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

ಜಿ

0.40

 Kinetics

0.40

㭐

0.40

 scripted

0.39

 সিঁ

0.39

orealistic

0.39

 kinetics

0.38

舗

0.37

 Safer

0.37

 safer

0.36

POSITIVE LOGITS

 BLUE

0.49

Blue

0.45

 Blue

0.43

 Gand

0.40

 മഹാ

0.39

blu

0.38

 martin

0.38

Santos

0.37

 santos

0.37

BLUE

0.36

Activations Density 0.014%