INDEX

Explanations

commands and requests

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 ours

-0.10

anan

-0.10

 friendly

-0.09

 Erotic

-0.09

 Prostit

-0.09

 Kiss

-0.08

 kiss

-0.08

 ï½°

-0.08

 kissed

-0.08

illo

-0.08

POSITIVE LOGITS

 command

0.13

 orders

0.12

åĳ½ä»¤

0.12

 requests

0.12

 commands

0.11

orders

0.11

commands

0.11

/command

0.11

 instruction

0.11

 request

0.10

Activations Density 0.054%