INDEX

Explanations

strength, adaptability, preparedness

language associated with AI safety-policy refusals and content moderation, flagging explanations of why a request is harmful or disallowed and redirections to safer alternatives or support resources.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

ᔨ

0.47

ゝ

0.43

棸

0.42

 "../../

0.42

 والإ

0.39

 regain

0.39

 symplect

0.39

交易所

0.39

 rehabilitate

0.38

 rehabilitation

0.38

POSITIVE LOGITS

INCRE

0.50

но

0.45

фект

0.42

orse

0.42

iny

0.41

jego

0.41

increased

0.41

новых

0.40

increases

0.39

 космо

0.39

Activations Density 10.297%