INDEX
Explanations
AI self-description using "but" and "I"
sentences or passages where the assistant introduces itself or describes its identity, training, capabilities, and availability.
The neuron is essentially flagging the assistant’s own “long‐form” explanation turns (the multi‐paragraph, bullet‐list responses) as opposed to user utterances. In other words, it turns on for tokens in the model’s detailed breakdowns.
New Auto-Interp
Negative Logits
我們要
0.51
!!!!!!!!!!!!!!!!
0.49
Fuck
0.47
vimos
0.46
нам
0.46
fucking
0.45
!!!!!!!!
0.45
consiglio
0.45
imo
0.44
veamos
0.44
POSITIVE LOGITS
Developers
0.63
Capabilities
0.63
Unlike
0.60
OpenAI
0.59
Developers
0.59
myself
0.58
developers
0.58
Capabilities
0.57
capabilities
0.56
Unlike
0.56
Activations Density 0.462%