INDEX

Explanations

describing personality traits

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 Erotik

-0.09

sexy

-0.09

 Pornhub

-0.09

 Sexy

-0.09

.synthetic

-0.08

ï½į

-0.08

Ð½Ð°Ð´Ð»ÐµÐ¶

-0.08

 erÃ³t

-0.08

 HÃ¼s

-0.08

ayd

-0.08

POSITIVE LOGITS

Bis

0.12

 willing

0.11

IQ

0.11

bis

0.10

 somewhat

0.10

 partial

0.10

 poly

0.09

 Narc

0.09

bis

0.09

 manip

0.09

Activations Density 0.084%