INDEX
Explanations
references to tables or data structures
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
243
+0.15
0.8%
224
+0.13
0.7%
357
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
357
+0.15
0.01
79
+0.13
0.01
224
+0.12
0.01
Negative Logits
charge
-1.79
ness
-1.72
harmless
-1.67
Ļª
-1.66
«
-1.66
Ń
-1.62
ĸ
-1.61
¬
-1.55
½
-1.55
¥
-1.54
POSITIVE LOGITS
au
1.94
bed
1.89
top
1.86
Tenn
1.86
mania
1.83
tennis
1.81
cloth
1.78
backs
1.61
Rum
1.61
rett
1.59
Activations Density 0.057%