INDEX
Explanations
comparisons of one thing being better than another
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1253
+0.09
0.3%
764
+0.08
0.2%
600
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
600
+0.09
0.03
1519
+0.08
0.03
1664
+0.08
0.02
Negative Logits
emphat
-1.28
ftu
-1.22
increa
-1.20
aen
-1.19
Lég
-1.18
Juf
-1.18
„,
-1.17
lele
-1.16
fta
-1.15
meis
-1.14
POSITIVE LOGITS
worse
0.63
sacrifice
0.54
マシ
0.54
atience
0.52
worst
0.52
Kč
0.51
than
0.51
risk
0.51
lieber
0.51
losing
0.50
Activations Density 0.270%