INDEX
Explanations
tags indicating categories or labels within the text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.22
1.2%
93
+0.12
0.7%
148
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
229
+0.22
0.02
93
+0.12
0.02
75
+0.11
0.02
Negative Logits
ij
-3.96
ĸ´
-3.60
ĥ½
-3.56
ħ
-3.51
-3.47
-3.47
↵↵
-3.47
<|outofrange|>
-3.47
↵
-3.47
-3.47
POSITIVE LOGITS
gered
1.75
read
1.72
zilla
1.72
lia
1.71
lane
1.70
liament
1.68
gart
1.67
gun
1.64
alin
1.58
ied
1.54
Activations Density 0.009%