INDEX

Explanations

words and phrases indicating non-violence or non-discrimination

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

cerebras/SlimPajama-627B

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

itis

-0.08

ovsky

-0.08

owell

-0.07

hurst

-0.07

kok

-0.06

ouser

-0.06

acific

-0.06

eyin

-0.06

raig

-0.06

edes

-0.06

POSITIVE LOGITS

 compromising

0.07

ucket

0.06

(er

0.06

 surprises

0.06

uck

0.06

 strain

0.06

 Glam

0.06

 harmful

0.06

agna

0.06

-to

0.06

Activations Density 0.017%