INDEX
Explanations
references to presentation slides
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
77
+0.11
0.7%
235
+0.11
0.7%
376
+0.11
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
26
+0.11
0.01
235
+0.11
0.01
449
+0.11
0.01
Negative Logits
§
-2.90
¦
-2.77
©
-2.68
Į
-2.61
ķ
-2.60
ĵ
-2.56
ij
-2.49
°
-2.48
¥
-2.47
Ħ
-2.38
POSITIVE LOGITS
heet
2.27
mith
2.22
heets
2.17
ource
2.15
chool
2.00
ystem
1.94
pot
1.92
core
1.91
urface
1.88
ior
1.85
Activations Density 0.009%