INDEX

Explanations

authenticity versus fakeness

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

agu

-0.09

 stranded

-0.08

 frankly

-0.08

 reasonable

-0.08

 obstruction

-0.08

 nons

-0.08

iswa

-0.08

nila

-0.08

 realistically

-0.08

Kir

-0.08

POSITIVE LOGITS

 authentic

0.59

 genuine

0.48

 Authentic

0.47

auth

0.45

 authenticity

0.43

 auth

0.40

Auth

0.40

enuine

0.40

 Genuine

0.36

-auth

0.36

Activations Density 0.142%