INDEX
Explanations
information related to conflicts and power dynamics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
426
+0.10
0.3%
1243
+0.09
0.3%
2041
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1243
+0.10
0.04
426
+0.09
0.03
1109
+0.08
0.03
Negative Logits
emphat
-0.98
impra
-0.98
Juf
-0.97
encomp
-0.97
reluct
-0.96
inext
-0.94
indestru
-0.93
inappro
-0.92
maneu
-0.90
increa
-0.90
POSITIVE LOGITS
ratio
1.08
ratios
1.02
imbalance
0.94
balance
0.92
ratio
0.89
balanced
0.84
balance
0.83
Ratio
0.82
balances
0.82
Ratio
0.80
Activations Density 0.313%