INDEX

Explanations

harmful or negative comments/opinions

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

¶Į

-0.16

UnderTest

-0.11

-scrollbar

-0.11

įng

-0.11

Â

-0.11

EMPLARY

-0.10

 DÃ¼n

-0.10

ÐĵÐŀ

-0.10

ÂĢÂĢ

-0.09

ozÃŃ

-0.09

POSITIVE LOGITS

(s

0.11

td

0.10

 Oswald

0.09

ione

0.09

st

0.09

Something

0.08

 EACH

0.08

å°½

0.08

set

0.08

 Couch

0.08

Activations Density 0.329%