INDEX
Explanations
phrases emphasizing the significance of certain concepts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
376
+0.15
0.9%
434
+0.13
0.8%
407
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
65
+0.15
0.05
407
+0.13
0.04
263
+0.12
0.04
Negative Logits
ĥ½
-1.79
Ĥ
-1.78
athing
-1.64
·
-1.63
Ń
-1.61
ı
-1.61
rency
-1.59
Cities
-1.59
¢
-1.53
ĥ
-1.49
POSITIVE LOGITS
nel
2.02
aliana
1.88
iop
1.74
nell
1.70
binding
1.56
iom
1.56
meal
1.55
ologic
1.49
avis
1.47
ograph
1.46
Activations Density 0.028%