INDEX

Explanations

jailbreak and boilerplate

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

OptionsMenu

-0.11

mast

-0.09

 Interracial

-0.08

ipher

-0.08

hon

-0.08

rypton

-0.08

.createServer

-0.08

ixer

-0.08

Kou

-0.08

âķĹ

-0.08

POSITIVE LOGITS

-style

0.09

 Joey

0.09

KIT

0.08

 whistle

0.08

iesen

0.08

Ã¡m

0.08

 ngÅ©

0.08

ajs

0.08

Wil

0.08

Activations Density 0.242%