INDEX

Explanations

content rating for young readers

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 privacy

-0.09

 abnormal

-0.09

soo

-0.09

odp

-0.09

eniz

-0.09

å§ĳ

-0.09

 Privacy

-0.08

 suspicious

-0.08

nection

-0.08

POSITIVE LOGITS

PG

0.25

PG

0.23

 violence

0.21

 rating

0.21

 Mature

0.20

 Violence

0.20

 rated

0.19

 mature

0.18

-rated

0.18

 Parent

0.17

Activations Density 0.085%