INDEX

Explanations

emotions

the neuron detects system/role/metadata and instruction-like tokens — i.e., blocks of system or assistant prompt text and formatting.

New Auto-Interp

Configuration

Dataset (Dashboard)

Various

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 originally

-0.12

 Usually

-0.11

 Using

-0.11

 Typically

-0.10

 Actually

-0.10

 Redwood

-0.10

 Exactly

-0.10

 used

-0.10

 fifteen

-0.10

 Could

-0.10

POSITIVE LOGITS

 emotions

0.22

 emotional

0.22

 anxiety

0.22

 emotion

0.20

 insecurity

0.20

 resentment

0.20

 discomfort

0.20

 empathy

0.19

 loneliness

0.19

 sadness

0.19

Activations Density 0.141%