INDEX
Explanations
phrases related to self-reflection and introspection
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
776
+0.10
0.3%
50
+0.10
0.3%
845
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
504
+0.10
0.04
1533
+0.10
0.01
845
+0.08
0.02
Negative Logits
maksi
-0.83
recev
-0.76
Keny
-0.76
timately
-0.75
vettoriale
-0.70
sopr
-0.68
évit
-0.66
azzurro
-0.66
seksi
-0.66
keramik
-0.66
POSITIVE LOGITS
sort
0.50
atguigu
0.49
spesies
0.48
cassert
0.46
كتوبر
0.45
maybe
0.43
пиона
0.43
gelopen
0.43
hashlib
0.42
frase
0.42
Activations Density 0.309%