INDEX
Explanations
references to barriers or obstacles
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
180
+0.13
0.8%
475
+0.12
0.7%
348
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
475
+0.13
0.01
180
+0.12
0.01
384
+0.11
0.01
Negative Logits
'))
-1.67
urred
-1.66
hentication
-1.65
'));
-1.57
xiety
-1.52
...](
-1.51
omorphisms
-1.49
agogue
-1.48
acity
-1.47
\\
-1.47
POSITIVE LOGITS
ström
2.03
bilt
1.96
ista
1.95
zilla
1.85
gren
1.82
iang
1.79
agem
1.76
istas
1.75
chaft
1.74
iative
1.73
Activations Density 0.050%