INDEX

Explanations

terms related to preventing negative consequences

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

zan

-0.08

thon

-0.08

atura

-0.08

THON

-0.07

aban

-0.07

.radioButton

-0.07

ibbon

-0.07

RIPTION

-0.07

egan

-0.07

 isize

-0.07

POSITIVE LOGITS

us

0.09

ively

0.09

/mit

0.09

 from

0.08

oreach

0.07

ä¸įäºĨ

0.07

-motion

0.07

/re

0.06

 others

0.06

 being

0.06

Activations Density 0.015%