INDEX
Explanations
mentions of specific locations
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.08
0.3%
324
+0.06
0.3%
689
+0.06
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
324
+0.08
0.04
1907
+0.06
0.04
306
+0.06
0.04
Negative Logits
<bos>
-1.12
do
-0.96
</tbody>
-0.94
.
-0.94
continue
-0.93
,
-0.93
have
-0.91
<eos>
-0.91
get
-0.90
in
-0.90
POSITIVE LOGITS
maneu
2.94
increa
2.87
accla
2.84
emphat
2.81
affor
2.75
perfet
2.72
madonna
2.70
disagre
2.68
desir
2.65
inev
2.65
Activations Density 0.110%