INDEX
Explanations
scenarios or options that involve making difficult decisions and their potential consequences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1173
+0.10
0.3%
674
+0.08
0.2%
623
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
623
+0.10
0.03
1173
+0.08
0.03
301
+0.08
0.02
Negative Logits
increa
-1.80
emphat
-1.80
disagre
-1.79
accla
-1.76
guarante
-1.75
depic
-1.73
wherea
-1.71
inev
-1.68
encomp
-1.68
affor
-1.67
POSITIVE LOGITS
option
1.34
option
1.19
Option
1.17
Option
1.10
options
1.05
OPTION
0.99
options
0.94
Options
0.91
choice
0.88
选项
0.85
Activations Density 0.515%