INDEX

Explanations

Milgram obedience experiments

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

Werbung

-0.81

 kwamen

-0.77

 WRITER

-0.75

isele

-0.75

 Forster

-0.75

 pintores

-0.75

フライ

-0.71

रीदारी

-0.70

ータス

-0.70

 ул

-0.69

POSITIVE LOGITS

diffusion

1.13

 obedience

1.08

 Obedience

1.01

Mil

0.96

Mil

0.94

 experimenter

0.93

 Compliance

0.91

obedience

0.90

 compliance

0.89

Compliance

0.89

Activations Density 0.016%