INDEX

Explanations

provide harmful output

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

apur

-0.09

rem

-0.09

 ãĤ

-0.08

ãĢ

-0.08

te

-0.08

lew

-0.08

 Ning

-0.08

Ung

-0.08

tut

-0.08

lab

-0.08

POSITIVE LOGITS

 hurt

0.11

 anything

0.10

 output

0.10

 harm

0.10

ummer

0.10

 Harm

0.10

 Anything

0.10

ismet

0.10

 produce

0.09

AI

0.09

Activations Density 0.047%