INDEX
Explanations
The neuron activates on self-referential stance phrases where the author expresses a goal of being positive or unbiased in their writing.
New Auto-Interp
Negative Logits
_formatter
-0.07
_ps
-0.07
rı
-0.06
K
-0.06
граф
-0.06
credible
-0.06
縮
-0.06
guarantee
-0.06
Upgrade
-0.06
缩
-0.06
POSITIVE LOGITS
lder
0.07
ливість
0.06
(tokens
0.06
.validators
0.06
,String
0.06
.csv
0.06
abbrev
0.06
.Css
0.06
("""0.06
Talking
0.06
Activations Density 0.068%