INDEX
Explanations
instructions and explanations
The neuron activates on mid‐frequency content words typical of explanatory answer sentences, signaling when detailed, informational language is being used.
New Auto-Interp
Negative Logits
Could
-0.07
_stride
-0.06
and
-0.06
chimp
-0.06
throw
-0.06
continue
-0.06
above
-0.06
neighbor
-0.06
[i
-0.06
}'",
-0.06
POSITIVE LOGITS
shemale
0.07
Lima
0.07
Cls
0.06
_CONFIRM
0.06
.nextElement
0.06
перев
0.06
xaf
0.06
disadv
0.06
说话
0.06
RU
0.06
Activations Density 0.027%