INDEX
Explanations
phrases expressing opinions or beliefs
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
184
+0.18
0.6%
2034
+0.12
0.4%
872
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
184
+0.18
0.02
605
+0.12
0.01
270
+0.11
0.03
Negative Logits
increa
-2.82
emphat
-2.75
fta
-2.71
guarante
-2.68
effe
-2.67
squa
-2.67
affor
-2.63
desir
-2.62
mef
-2.61
ftu
-2.61
POSITIVE LOGITS
I
1.22
if
1.05
We
1.01
we
1.00
If
0.96
.
0.96
<eos>
0.94
if
0.94
I
0.94
[
0.93
Activations Density 0.101%