INDEX

Explanations

negative qualifiers and self-descriptions concerning personal identity and involvement

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

itial

-0.08

ceph

-0.07

ettel

-0.07

NL

-0.07

 cuckold

-0.07

eper

-0.07

Ãłm

-0.07

æķ¢

-0.07

foon

-0.07

acie

-0.07

POSITIVE LOGITS

nor

0.10

 experts

0.07

 expert

0.07

robot

0.07

 rocket

0.06

 Anch

0.06

rys

0.06

 anch

0.06

Nor

0.06

 experienced

0.06

Activations Density 0.013%