INDEX
Explanations
phrases related to self-reflection and introspection
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
381
+0.14
0.4%
674
+0.13
0.4%
599
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1
+0.14
0.04
1282
+0.13
0.02
1650
+0.11
0.02
Negative Logits
Juf
-1.79
stockholm
-1.77
dises
-1.69
lidl
-1.66
lyon
-1.63
wien
-1.61
leonardo
-1.61
squa
-1.59
frankfurt
-1.59
jorge
-1.58
POSITIVE LOGITS
<bos>
1.35
definitely
0.79
actually
0.72
really
0.71
my
0.70
pretty
0.70
probably
0.69
very
0.69
I
0.69
honestly
0.68
Activations Density 0.338%