INDEX

Explanations

No Explanations Found

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 wickedness

0.73

 deceived

0.62

 wicked

0.60

 defying

0.59

 implique

0.59

 wrongful

0.57

卧

0.57

 impossible

0.56

 deception

0.55

 disobey

0.55

POSITIVE LOGITS

 awkwardly

1.29

 awkward

1.23

 cringe

1.08

 overly

1.02

尷

0.98

 bland

0.95

 cheesy

0.90

 jarring

0.88

 lackluster

0.88

 underwhelming

0.86

Activations Density 0.660%

No Known Activations