INDEX

Explanations

instances of insults and derogatory language

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

orr

-0.07

ilden

-0.07

yles

-0.07

ales

-0.07

gie

-0.07

erness

-0.07

stral

-0.07

ills

-0.07

elp

-0.07

over

-0.07

POSITIVE LOGITS

ively

0.09

ingly

0.09

ably

0.08

uous

0.08

atory

0.08

antly

0.07

ive

0.07

urb

0.06

acios

0.06

Activations Density 0.004%