INDEX
Explanations
statements expressing value judgments or opinions on what actions should or should not take place
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1124
+0.12
0.4%
900
+0.12
0.3%
674
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1124
+0.12
0.04
208
+0.12
0.04
900
+0.11
0.04
Negative Logits
apprehen
-1.07
intersper
-1.03
unspeak
-1.00
emphat
-0.95
pamph
-0.93
accla
-0.91
vainly
-0.90
maneu
-0.90
inconce
-0.89
intrigu
-0.89
POSITIVE LOGITS
should
0.64
should
0.63
noten
0.59
shouldn
0.59
Should
0.57
SHOULD
0.54
be
0.53
Should
0.52
'\\;'
0.51
oplayer
0.49
Activations Density 0.138%