INDEX
Explanations
The neuron fires on single‐word affirmative/confirmation tokens (e.g. “Yes,” “true,” “Exactly”).
New Auto-Interp
Negative Logits
these
-0.06
disagreements
-0.06
튼
-0.06
Some
-0.06
linewidth
-0.06
Sala
-0.06
Senate
-0.06
"'.$
-0.06
Equipment
-0.06
spectro
-0.06
POSITIVE LOGITS
tidak
0.07
(if
0.07
hely
0.06
はい
0.06
tüm
0.06
ruining
0.06
Việc
0.06
Fehler
0.06
fringe
0.06
adress
0.06
Activations Density 0.028%