INDEX

Explanations

Okay

The neuron detects the "model" (assistant) speaker token—i.e., the start of model/assistant responses.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

orum

0.42

 інформа

0.42

 нашей

0.42

ще

0.42

 да

0.42

нных

0.42

 paraphr

0.42

 пита

0.41

 orifice

0.41

 горе

0.41

POSITIVE LOGITS

your

0.46

 secretly

0.44

 hopelessly

0.43

 ඔබේ

0.43

 fascist

0.42

 இரவு

0.42

 secret

0.42

 blazing

0.42

 YOUR

0.41

 Your

0.41

Activations Density 0.057%