INDEX

Explanations

race, ethnicity, religion, gender

This neuron isn’t picking out any linguistic feature of the words themselves but rather their position in the input: it reliably spikes on tokens that occur around the same absolute sequence index.

New Auto-Interp

Configuration

Prompts (Dashboard)

392,802 prompts, 256 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

ഡ

0.95

 accentuated

0.94

 tiempo

0.91

 கிஷ

0.87

՝

0.84

 conecta

0.83

ﺨ

0.82

 anhydrous

0.82

নের

0.81

ﮑ

0.81

POSITIVE LOGITS

ES

0.79

ούν

0.72

 Screenshots

0.72

ard

0.70

禍

0.69

ству

0.68

 Histogram

0.68

ある

0.67

 Travelling

0.67

újt

0.66

Activations Density 0.000%