INDEX
Explanations
phrases indicating negation or lack of certainty
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1622
+0.14
0.4%
537
+0.10
0.3%
1124
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1622
+0.14
0.03
581
+0.10
0.03
1262
+0.09
0.02
Negative Logits
lele
-0.88
fua
-0.82
Ibidem
-0.81
hina
-0.80
Simult
-0.80
pama
-0.77
Membre
-0.77
Amerik
-0.76
kram
-0.76
kasa
-0.75
POSITIVE LOGITS
necessarily
0.74
may
0.68
siquiera
0.59
be
0.58
principalTable
0.58
might
0.58
<bos>
0.56
may
0.55
not
0.55
necessarily
0.52
Activations Density 0.080%