INDEX
Explanations
references to the number of pages
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
466
+0.12
0.8%
355
+0.11
0.7%
241
+0.11
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
355
+0.12
0.01
466
+0.11
0.01
255
+0.11
0.01
Negative Logits
rians
-2.18
trim
-1.89
rian
-1.85
omitempty
-1.78
silence
-1.72
matically
-1.72
'</
-1.69
yours
-1.69
fair
-1.69
harmless
-1.61
POSITIVE LOGITS
helf
2.54
ystems
1.99
chaft
1.92
ugu
1.84
fel
1.83
ist
1.82
cule
1.80
ource
1.78
mith
1.77
fors
1.75
Activations Density 0.011%