INDEX

Explanations

The core pattern across all these lists seems to be the neuron activating for the word "even" when it's used to introduce a concession or a specific framing for a refusal, particularly in the context of AI safety guidelines. The texts often say things like "even one involving..." or "even framed within...".The `MAX_ACTIVATING_TOKENS` is filled with "even".The `TOKENS_AFTER_MAX_ACTIVATING_TOKEN` has words like "one", "in", "framed", "with", "involving", "between". These often follow "even" to specify a condition or scenario that the preceding context (usually a refusal) still applies to.even followed by conditions

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

вайтесь

0.37

 Polymer

0.37

 STATEMENT

0.37

 לג

0.36

 HART

0.35

 polymers

0.35

oni

0.35

\,\

0.35

同學們

0.35

 justamente

0.35

POSITIVE LOGITS

 Leng

0.38

picture

0.36

aría

0.36

 caractér

0.36

leng

0.36

gave

0.36

Serie

0.36

 none

0.35

linux

0.35

 nowhere

0.35

Activations Density 0.005%