INDEX

Explanations

420

The neuron responds to words ending in “-friendly” (e.g. “friendly”).

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

the

-2.61

跸

-2.16

 aveva

-2.08

 avait

-2.00

釤

-2.00

 alcune

-1.92

 molte

-1.90

 Then

-1.89

ations

-1.88



-1.85

POSITIVE LOGITS

沨

2.44

One

2.19

With

2.09

羅

2.08

From

2.06

Before

2.03

Although

2.00

otriva

1.98

 encourages

1.96

When

1.95

Activations Density 0.013%