INDEX

Explanations

not factually coherent or does not make sense

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ÙģØ³

-0.09

 Ð§Ð¸

-0.09

esso

-0.08

 sodom

-0.08

arend

-0.08

kÃ¼

-0.08

 Bast

-0.08

nze

-0.08

uisse

-0.08

nt

-0.08

POSITIVE LOGITS

 cannot

0.14

cannot

0.11

 outside

0.10

 beyond

0.10

 contain

0.10

 Cannot

0.09

 contains

0.09

 seem

0.09

auen

0.09

AndWait

0.09

Activations Density 0.013%