INDEX

Explanations

I am digging a hole

sentences where the assistant refuses a harmful request or provides safety/warning language (e.g., "I am programmed to be a safe/harmless AI" and related resources).

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

pyridine

0.19

»),

0.18

 glycer

0.18

δ

0.18

 derajat

0.17

 حقیقت

0.17

}),

0.17

त्मक

0.17

“,

0.17

=',

0.17

POSITIVE LOGITS

 welkom

0.19

 कॅ

0.18

苏

0.18

মন্দ

0.17

 그냥

0.17

↵↵↵

0.17

पेट

0.17

 колдон

0.17

ール

0.17

トイレ

0.17

Activations Density 0.582%