INDEX
Explanations
Excerpts from longer texts
This neuron activates on mentions of the assistant’s “answers,” i.e. references to its own responses in the instruction text.
New Auto-Interp
Negative Logits
(style
-0.07
hide
-0.07
FileMode
-0.06
StatusCode
-0.06
mü
-0.06
However
-0.06
ерами
-0.06
드
-0.06
/action
-0.06
_legend
-0.06
POSITIVE LOGITS
[ii
0.07
eným
0.06
Empleado
0.06
medida
0.06
برابر
0.06
Front
0.06
нулась
0.06
prehensive
0.06
obec
0.06
cov
0.06
Activations Density 0.010%