INDEX

Explanations

censorship and interference

New Auto-Interp

Configuration

Prompts (Dashboard)

16,384 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

è¿

-0.10

cole

-0.09

æĻı

-0.09

ç«ĭè¶³

-0.09

nodoc

-0.09

.twitter

-0.08

æĿ¾å¼Ľ

-0.08

ä»ģ

-0.08

çģ¿

-0.08

à¸ģà¸£à¸£à¸¡

-0.08

POSITIVE LOGITS

 censorship

0.18

 censor

0.17

éĴ³

0.14

åİĭåĪ¶

0.13

åĻ¤

0.12

ç¦ģåĮº

0.11

åİĭæĬĳ

0.11

 stif

0.10

æĿŁç¼ļ

0.10

åīĬ

0.10

Activations Density 0.093%