INDEX
Explanations
phrases related to accountability and taking responsibility
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
420
+0.15
0.6%
410
+0.15
0.5%
397
+0.14
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
420
+0.15
0.04
410
+0.15
0.03
976
+0.14
0.03
Negative Logits
Strukt
-0.55
*
-0.55
saar
-0.53
robus
-0.51
maksi
-0.51
Mä
-0.49
conflic
-0.49
valla
-0.48
kado
-0.48
kram
-0.47
POSITIVE LOGITS
responsibility
1.23
responsible
1.22
Responsible
1.18
Responsibility
1.12
Responsible
1.09
responsibility
1.09
responsible
1.09
respon
0.96
Responsibility
0.94
RESPONS
0.94
Activations Density 0.060%