INDEX

Explanations

warnings, dangers, caveats

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

.poi

-0.09

panic

-0.09

ussed

-0.09

 flush

-0.09

:::::::::

-0.08

ä¸Ī

-0.08

proof

-0.08

AllowAnonymous

-0.08

sto

-0.08

 straight

-0.08

POSITIVE LOGITS

 warning

0.18

 warnings

0.18

 disclaimer

0.16

 warn

0.14

 Warning

0.13

warnings

0.13

 cave

0.13

 boiler

0.12

warn

0.11

 Cave

0.11

Activations Density 0.057%