INDEX

Explanations

can also lead/perpetuate/be

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

icz

-0.13

 harming

-0.10

igo

-0.10

eyJ

-0.10

pha

-0.10

 rhet

-0.09

ellation

-0.08

_phr

-0.08

ÑĢÑĸÐ·

-0.08

POSITIVE LOGITS

 contr

0.16

 perpet

0.12

er

0.11

ripple

0.11

contr

0.11

 create

0.10

ours

0.10

 overall

0.10

 also

0.10

 strain

0.10

Activations Density 0.060%