INDEX

Explanations

harming/threatening

New Auto-Interp

Configuration

Prompts (Dashboard)

16,384 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

threatening

-0.87

 threatening

-0.83

 endangering

-0.82

 harmful

-0.77

 damaging

-0.74

 endanger

-0.69

 harming

-0.69

 hazard

-0.66

 colorir

-0.65

 dañ

-0.63

POSITIVE LOGITS

stateProvider

0.63

OrNil

0.59

jspx

0.59

nsic

0.58

the

0.56

 préc

0.56

Ⓒ

0.55

Portale

0.55

ri

0.53

RegistryLite

0.53

Activations Density 0.170%