INDEX
Explanations
phrases related to readiness or willingness to undertake particular actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1892
+0.13
0.4%
1310
+0.12
0.4%
411
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1310
+0.13
0.03
1892
+0.12
0.03
411
+0.10
0.02
Negative Logits
inder
-0.99
emphat
-0.94
zyn
-0.92
effe
-0.88
pessi
-0.88
fundament
-0.87
abnorm
-0.87
ert
-0.87
kram
-0.86
aen
-0.85
POSITIVE LOGITS
willing
1.22
willing
1.18
willingness
1.05
Willing
1.04
Willing
0.96
unwilling
0.78
skimage
0.62
willingly
0.62
reluctant
0.58
bereit
0.55
Activations Density 0.048%