INDEX
Explanations
The neuron is specialized in detecting the token “No” in the assistant’s responses.
New Auto-Interp
Negative Logits
put
-0.06
گرد
-0.06
.dir
-0.06
.COM
-0.06
explanations
-0.06
тим
-0.06
Ли
-0.06
fi
-0.06
_PUT
-0.06
Stefan
-0.06
POSITIVE LOGITS
stimulate
0.07
ốc
0.07
getInt
0.06
_neurons
0.06
expanded
0.06
toDouble
0.06
vigorously
0.06
haar
0.06
ayında
0.06
acent
0.06
Activations Density 0.010%