INDEX

Explanations

morally objectionable or hateful speech

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 åren

-1.08

tió

-1.05

 possibilité

-1.02

тельства

-1.00

 SITY

-0.98

 možnosti

-0.97

㍑

-0.96

ateľ

-0.95

 illusions

-0.94

 multicolore

-0.94

POSITIVE LOGITS

2.52

an

1.77

 akin

1.57

 equivalent

1.21

是一种

1.20

 putting

1.15

การ

1.13

 evidence

1.09

 taking

1.02

 called

1.01

Activations Density 0.055%