INDEX

Explanations

harmful or negative traits

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 olmak

-0.09

 ""},\n

-0.09

 ãĢī

-0.09

 Bald

-0.08

eric

-0.08

ä¹ĥ

-0.08

 entirely

-0.08

ä¸Ī

-0.08

amar

-0.08

à¸¹à¸Ĺ

-0.08

POSITIVE LOGITS

nor

0.29

or

0.25

æĪĸ

0.19

 veya

0.18

 Ð¸Ð»Ð¸

0.18

nor

0.18

 hoáº·c

0.17

 atau

0.16

 ÛĮØ§

0.16

ê±°ëĤĺ

0.15

Activations Density 0.101%