INDEX
Explanations
This neuron fires on tokens in the assistant’s informative, explanatory answer passages.
New Auto-Interp
Negative Logits
650
-0.07
pup
-0.07
(common
-0.07
پرو
-0.06
接
-0.06
acos
-0.06
解
-0.06
otomy
-0.06
WA
-0.06
ِك
-0.06
POSITIVE LOGITS
toHave
0.06
Vanity
0.06
=").
0.06
Highlighted
0.06
imageURL
0.06
petty
0.06
속
0.06
.unsubscribe
0.05
]")
0.05
repositories
0.05
Activations Density 0.183%