INDEX

Explanations

causing damage or negative effects

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

-2.63

🕗

-2.39

iſter

-2.14

 gouver

-2.08

So

-2.08

ism

-2.05

 kaas

-2.03

pü

-2.00

錘

-1.98

-1.96

POSITIVE LOGITS

ly

2.69

 così

2.36

to

2.28

鹇

2.20

 perceptive

2.19

 damaging

2.03

🧌

2.00

鸶

1.94

檠

1.89

﹐

1.88

Activations Density 0.016%