INDEX
Explanations
periods at the end of sentences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1535
+0.32
1.1%
2034
+0.23
0.8%
674
+0.23
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.32
0.08
1535
+0.23
0.07
752
+0.23
0.04
Negative Logits
increa
-2.35
disagre
-2.29
affor
-2.29
reluct
-2.28
depic
-2.26
unwarran
-2.21
maneu
-2.21
shenan
-2.18
viciss
-2.17
guarante
-2.16
POSITIVE LOGITS
↵↵
1.35
↵↵↵
1.18
↵
1.10
↵↵↵↵
1.08
↵↵↵↵↵
1.02
<eos>
1.02
And
0.97
0.96
But
0.92
</h1>
0.91
Activations Density 0.213%