INDEX
Explanations
dialogue interaction involving scolding or reprimanding someone
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2019
+0.15
0.5%
1535
+0.15
0.5%
382
+0.14
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.15
0.07
509
+0.15
0.06
1533
+0.14
0.03
Negative Logits
guarante
-2.43
increa
-2.40
affor
-2.37
emphat
-2.32
encomp
-2.28
maneu
-2.22
reluct
-2.22
strick
-2.21
inev
-2.20
disagre
-2.19
POSITIVE LOGITS
<eos>
0.96
She
0.87
He
0.87
But
0.83
↵↵
0.81
he
0.80
So
0.80
she
0.79
XmlSchema
0.78
↵
0.77
Activations Density 0.286%