INDEX
Explanations
references to collective action and responsibility
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.41
1.5%
1919
+0.11
0.4%
1510
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1919
+0.41
0.18
1415
+0.11
0.10
1510
+0.09
0.07
Negative Logits
<bos>
-2.25
gend
-0.86
adal
-0.81
vang
-0.80
gie
-0.79
glan
-0.78
frans
-0.78
ù
-0.77
puc
-0.75
hej
-0.72
POSITIVE LOGITS
should
1.03
shouldn
0.98
soulign
0.96
Should
0.93
Shouldn
0.93
véhic
0.93
tupperware
0.92
ought
0.89
Should
0.87
need
0.86
Activations Density 0.821%