INDEX
Explanations
referencing previous information
The neuron fires on self-referential cue phrases about the assistant’s prior remarks (e.g. “mentioned in my previous response,” “above”).
New Auto-Interp
Negative Logits
reminders
-0.08
Message
-0.07
unknown
-0.06
NoSuch
-0.06
Teddy
-0.06
ंदर
-0.06
carc
-0.06
cancellationToken
-0.06
.rd
-0.06
button
-0.06
POSITIVE LOGITS
SCO
0.07
灣
0.06
تل
0.06
Auswahl
0.06
crumbling
0.06
ução
0.06
Illustrated
0.06
호텔
0.06
AUX
0.06
фор
0.06
Activations Density 0.131%