INDEX

Explanations

actions related to deception or manipulation in medical contexts

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

º

-0.08

anzi

-0.07

_MAJOR

-0.07

Äł

-0.07

allis

-0.07

 pedest

-0.07

.scalablytyped

-0.07

 instincts

-0.07

 Ð¾ÑĤÐ»Ð¸

-0.07

amble

-0.06

POSITIVE LOGITS

 multiple

0.07

encers

0.07

 themselves

0.07

 just

0.06

 Strat

0.06

 simply

0.06

 often

0.06

or

0.06

 strategically

0.06

 cheap

0.06

Activations Density 0.043%