INDEX

Explanations

phrases related to prioritizing values over personal interests

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ta

-0.07

imli

-0.07

ohn

-0.07

enc

-0.07

enco

-0.07

vid

-0.07

-0.06

iba

-0.06

VID

-0.06

gest

-0.06

POSITIVE LOGITS

 priority

0.09

 priorities

0.09

 prioritize

0.07

 safety

0.07

.priority

0.07

ÏĢÎ¿Ïį

0.07

afety

0.07

priority

0.06

 concerns

0.06

 interests

0.06

Activations Density 0.096%