INDEX
Explanations
Code-related text
This neuron fires on the explicit answer‐choice labels “positive” and “negative.”
New Auto-Interp
Negative Logits
.Mouse
-0.08
iagnostics
-0.07
ROM
-0.07
علام
-0.06
==========
-0.06
(Constructor
-0.06
_write
-0.06
khoa
-0.06
_iso
-0.06
COD
-0.06
POSITIVE LOGITS
Clara
0.06
learned
0.06
rewarded
0.06
ar
0.06
PUT
0.06
Inches
0.06
reported
0.06
doubts
0.05
addiction
0.05
Sim
0.05
Activations Density 0.007%