INDEX
Explanations
Reasoning and explanations
inconsistencies between summaries and the original document's factual content.
This neuron fires on discourse markers and numerals used in step-by-step explanations (for example, words like “Therefore” or numeric tokens) indicating logical or quantitative reasoning steps.
New Auto-Interp
Negative Logits
Cs
-0.07
darauf
-0.07
распрост
-0.07
Related
-0.06
meshes
-0.06
LCD
-0.06
ären
-0.06
Band
-0.06
آم
-0.06
ales
-0.06
POSITIVE LOGITS
possível
0.06
Grass
0.06
JI
0.06
�
0.06
INTER
0.06
ених
0.06
dynamic
0.06
pige
0.06
одав
0.06
lightly
0.06
Activations Density 0.057%