INDEX
Explanations
The neuron fires on apology expressions (e.g. “I apologize,” “sorry,” etc.), signaling when the model is offering a regretful or apologetic response.
New Auto-Interp
Negative Logits
“No
-0.07
코
-0.07
-sidebar
-0.07
brakk
-0.06
ku
-0.06
_PATH
-0.06
قام
-0.06
ourt
-0.06
541
-0.06
Damascus
-0.06
POSITIVE LOGITS
奖
0.07
trajectory
0.07
_sel
0.07
Isn
0.06
cleanly
0.06
Ont
0.06
HTMLElement
0.06
shares
0.06
textStyle
0.06
NSS
0.06
Activations Density 0.005%