INDEX
Explanations
The neuron specifically activates on mentions of the “token” reward‐and‐penalty system—that is, lines like “You have 10 tokens to start,” “Each time you reject a question… 5 tokens will be deducted,” and similar token‐count instructions.
New Auto-Interp
Negative Logits
Album
-0.07
UR
-0.07
ott
-0.06
leaning
-0.06
poetic
-0.06
practicing
-0.06
tok
-0.06
Owens
-0.06
стра
-0.06
ircles
-0.06
POSITIVE LOGITS
/init
0.07
_NAV
0.06
.wait
0.06
_DP
0.06
mocked
0.06
"Don
0.06
ayan
0.06
↵
0.06
publication
0.06
-option
0.06
Activations Density 0.002%