INDEX

Explanations

illegal or harmful content

words describing prohibited content types and policy violations on online platforms.

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 vinci

-1.24

 marta

-1.20

 satel

-1.16

はこんな感じ

-1.13

 wanda

-1.13

 philippe

-1.13

を知る

-1.12

 dorado

-1.10

 paulo

-1.09

 marmor

-1.09

POSITIVE LOGITS

or

1.46

 versátil

1.45

 Bardzo

1.43

 içeri

1.41

 görüntüsü

1.30

 delitos

1.27

 Сергей

1.16

либо

1.15

 content

1.15

 extremadamente

1.14

Activations Density 0.036%