INDEX

Explanations

I cannot, I'm sorry, I must

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 guar

-0.11

arrant

-0.10

 Richt

-0.10

echan

-0.10

 invit

-0.09

 Duch

-0.09

anal

-0.09

 æ°

-0.09

invitation

-0.08

ayacak

-0.08

POSITIVE LOGITS

 cannot

0.11

radu

0.10

 responsibility

0.10

 ethical

0.10

åĬ

0.10

 duty

0.10

 obligation

0.10

 moral

0.10

ummer

0.09

 shar

0.09

Activations Density 0.151%