INDEX

Explanations

This neuron detects terms that signal disallowed or prohibited content in guideline-style text, especially those about listing or revealing personal or identifying information.

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 whereupon

0.75

 ceas

0.73

 وق

0.68

 परन्तु

0.68

 postice

0.68

 beset

0.65

 всеми

0.64

opportunity

0.64

 удовле

0.63

 andRow

0.63

POSITIVE LOGITS

比如

0.97

 hairstyles

0.86

 esimerkiksi

0.86

isetas

0.84

比如说

0.83

 gimm

0.82

 adjectives

0.82

 hashtags

0.80

 lyrics

0.79

比如說

0.79

Activations Density 0.017%