INDEX

Explanations

manipulative techniques

New Auto-Interp

Top Features by Cosine Similarity

Configuration

Prompts (Dashboard)

10,000 prompts, 128 tokens each

Dataset (Dashboard)

lmsys/lmsys-chat-1m

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 discretion

-0.10

ikip

-0.09

Coc

-0.09

 Discrim

-0.09

 counterfeit

-0.08

olen

-0.08

æĨ

-0.08

 Foster

-0.08

tim

-0.08

 ritual

-0.08

POSITIVE LOGITS

 techniques

0.24

Manip

0.23

 manip

0.23

 Manip

0.23

 manipulation

0.21

 Techniques

0.20

 tactics

0.19

 technique

0.19

 manipulate

0.19

æĵį

0.17

Activations Density 0.151%