INDEX
Explanations
The neuron activates on polite closing or congratulatory words—especially “helpful”—and the accompanying exclamation mark in the assistant’s upbeat wrap‐up sentences.
New Auto-Interp
Negative Logits
وث
-0.07
simult
-0.06
coll
-0.06
WATCH
-0.06
payoff
-0.06
networks
-0.06
_bad
-0.06
Defense
-0.06
west
-0.06
strap
-0.06
POSITIVE LOGITS
ประกอบ
0.07
откры
0.07
operator
0.06
ンの
0.06
ческая
0.06
теор
0.06
_sprite
0.06
‐
0.06
дити
0.06
/order
0.06
Activations Density 0.011%