INDEX
Explanations
This neuron fires on tokens in the model’s “No, the summary is not factually consistent with the document.” response—i.e. it detects the explicit “No” negation and surrounding phrasing that rejects consistency.
New Auto-Interp
Negative Logits
Closure
-0.07
ERY
-0.07
ery
-0.06
sharp
-0.06
це
-0.06
Antarctica
-0.06
yı
-0.06
Mitar
-0.06
minions
-0.06
ゅ
-0.06
POSITIVE LOGITS
bergen
0.07
рис
0.06
(dec
0.06
0.06
,SIGNAL
0.06
0.06
<div
0.06
(QWidget
0.06
.hl
0.06
footer
0.06
Activations Density 0.033%