INDEX

Explanations

phrases emphasizing honesty and self-reflection

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

ceed

-0.07

thon

-0.07

 Chapman

-0.07

onso

-0.07

nat

-0.06

esel

-0.06

(DialogInterface

-0.06

iece

-0.06

ataires

-0.06

isset

-0.06

POSITIVE LOGITS

_tac

0.08

0.07

ÙģØ§ÙĤ

0.07

0.06

bakan

0.06

ãĥ©ãĤ¯

0.06

igmoid

0.06

Mutable

0.06

enance

0.06

Activations Density 0.002%