INDEX

Explanations

wrongness

The neuron specifically activates on occurrences of the word “wrong.”

New Auto-Interp

Configuration

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

剮

-3.13

 عندما

-2.75

-2.67

 неболь

-2.61

ਞ

-2.58

 самая

-2.56

邶

-2.56

鲻

-2.55

POSITIVE LOGITS

翾

2.72

 Ideally

2.58

熜

2.50

ػ

2.50

 Thus

2.48

玏

2.45

 Notably

2.42

 Currently

2.42

$,

2.41

 Roughly

2.39

Activations Density 0.010%