INDEX
Explanations
affirmative answers with explanations
the neuron detects strong affirmative or positive-response tokens—i.e., when the model is asserting agreement or labeling content as positive.
New Auto-Interp
Negative Logits
Answer
0.56
answered
0.56
ANSWER
0.54
No
0.52
YES
0.51
answer
0.51
yes
0.50
Yes
0.50
yes
0.50
answer
0.47
POSITIVE LOGITS
契
0.47
而且
0.46
ське
0.43
сан
0.43
齬
0.43
ELEASE
0.42
。.
0.42
Story
0.42
そして
0.42
révol
0.42
Activations Density 0.069%