INDEX
Explanations
greetings and welcoming phrases
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.22
1.3%
263
+0.15
0.9%
302
+0.13
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
156
+0.22
0.42
111
+0.15
0.41
71
+0.13
0.40
Negative Logits
ħ
-1.74
¯
-1.67
ŀ
-1.63
Ļ
-1.58
ĥ
-1.57
¾
-1.53
ĸ
-1.52
Ŀ
-1.51
&=
-1.50
ģ
-1.49
POSITIVE LOGITS
chat
1.82
congrat
1.80
yours
1.78
welcome
1.68
joking
1.66
Welcome
1.61
my
1.59
questions
1.57
comments
1.56
hello
1.55
Activations Density 1.609%