INDEX
Explanations
The main thing this neuron does is detect occurrences of the word “negative.”
New Auto-Interp
Negative Logits
нам
-0.08
بزر
-0.07
owntown
-0.06
moons
-0.06
Convert
-0.06
ezpe
-0.06
Fitness
-0.06
Wr
-0.06
badge
-0.06
"))))↵
-0.06
POSITIVE LOGITS
الل
0.07
hous
0.07
dispers
0.06
赢
0.06
">↵
0.06
Tight
0.06
conceive
0.06
音乐
0.06
borrowed
0.06
věci
0.06
Activations Density 0.003%