INDEX
Explanations
technical instructions or steps for completing a task
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
876
+0.20
0.7%
184
+0.16
0.6%
674
+0.13
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
678
+0.20
0.05
876
+0.16
-0.01
468
+0.13
0.03
Negative Logits
.
-0.72
↵↵
-0.67
<eos>
-0.67
despite
-0.67
..
-0.66
So
-0.65
She
-0.65
It
-0.65
My
-0.64
The
-0.64
POSITIVE LOGITS
milano
1.88
affez
1.86
ftu
1.86
napoli
1.85
swarovski
1.83
desir
1.80
fluo
1.78
erec
1.78
!...
1.77
canel
1.77
Activations Density 0.291%