INDEX
Explanations
sentences discussing varying perspectives on a specific issue
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.22
0.7%
1535
+0.18
0.6%
382
+0.17
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.22
0.09
1535
+0.18
0.06
827
+0.17
0.05
Negative Logits
impra
-2.57
maneu
-2.52
increa
-2.49
emphat
-2.46
affor
-2.45
milf
-2.42
hairc
-2.41
scrat
-2.41
suscep
-2.41
disagre
-2.40
POSITIVE LOGITS
He
1.19
“
1.17
"
1.16
She
1.05
«
1.05
“
1.03
They
1.03
↵↵
0.98
”
0.98
„
0.97
Activations Density 0.270%