INDEX
Explanations
phrases indicating thought and reflection on past actions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.14
0.5%
344
+0.10
0.4%
845
+0.08
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
381
+0.14
0.03
76
+0.10
0.05
1445
+0.08
0.07
Negative Logits
<bos>
-2.00
/**
-1.06
ⓧ
-1.02
effectually
-0.90
-0.88
<?
-0.88
forbear
-0.86
quitted
-0.85
gratify
-0.82
<?
-0.82
POSITIVE LOGITS
vasi
0.89
tyn
0.87
asfal
0.85
ananas
0.84
ortop
0.84
alpes
0.83
marte
0.82
torba
0.81
Ferdin
0.78
antropo
0.77
Activations Density 0.913%