INDEX
Explanations
expressions of disagreement
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1325
+0.13
0.4%
765
+0.12
0.4%
411
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1325
+0.13
0.02
276
+0.12
0.03
889
+0.10
0.02
Negative Logits
?...
-0.82
!...
-0.81
Kün
-0.79
emphat
-0.78
Simult
-0.77
impractica
-0.76
unlaw
-0.75
Fasc
-0.75
effe
-0.71
fuf
-0.69
POSITIVE LOGITS
disagree
1.11
disagrees
0.91
agree
0.84
disagreed
0.82
disagreement
0.77
agrees
0.76
agreement
0.75
Agree
0.74
agree
0.69
Disagree
0.68
Activations Density 0.084%