INDEX

Explanations

thinking about self reflection

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ìĿĳ

-0.09

 sympath

-0.09

dra

-0.09

antal

-0.09

å¿ľ

-0.09

enc

-0.08

 uncon

-0.08

 approaches

-0.08

opian

-0.08

 Braun

-0.08

POSITIVE LOGITS

 intros

0.36

 self

0.35

 Self

0.31

intros

0.29

Self

0.27

 reflection

0.26

 reflect

0.26

 SELF

0.25

(self

0.24

self

0.24

Activations Density 0.131%