INDEX

Explanations

safe and ethical boundaries

The neuron fires on positive, evaluative adjectives describing desirable qualities (e.g. safe, fun, natural, soft, smooth, healthy, positive, constructive, cost-effective).

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

larıyla

0.80

 strany

0.79

 zarówno

0.79

ları

0.74

lsulfanyl

0.69

 folgen

0.67

larından

0.64

 kumb

0.64

्स

0.64

 của

0.64

POSITIVE LOGITS

acting

0.71

 спосо

0.67

aining

0.66

hythm

0.66

 बुनिया

0.65

بت

0.63

 ジェ

0.63

ｹ

0.63

IMUM

0.62

 クリア

0.62

Activations Density 0.162%