INDEX
Explanations
This neuron fires on system‐style or meta instructions (e.g. the word “revert” and accompanying formatting tokens like “with” and quotation marks).
New Auto-Interp
Negative Logits
)");↵↵
-0.07
doctoral
-0.07
オン
-0.07
oltage
-0.06
Katz
-0.06
circumcision
-0.06
NA
-0.06
traf
-0.06
.staff
-0.06
"):↵
-0.06
POSITIVE LOGITS
revert
0.11
reverted
0.10
ocu
0.07
verting
0.07
чивается
0.07
rever
0.07
going
0.07
persisted
0.07
:white
0.07
abide
0.07
Activations Density 0.002%