INDEX

Explanations

graphic violence and sexuality

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

edException

-0.09

Viet

-0.09

 convers

-0.09

RelativeTo

-0.09

ocide

-0.09

wizard

-0.08

ers

-0.08

cum

-0.08

POSITIVE LOGITS

ous

0.12

ence

0.11

 femmes

0.11

ented

0.10

alia

0.10

encia

0.10

 physical

0.10

 streak

0.09

Vor

0.09

uous

0.09

Activations Density 0.024%