INDEX
Explanations
phrases indicating uncertainty or comparison between multiple elements
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.22
0.7%
678
+0.09
0.3%
108
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
50
+0.22
0.12
1959
+0.09
0.12
394
+0.09
0.08
Negative Logits
tré
-0.58
trá
-0.55
sér
-0.54
<s>
-0.54
sexu
-0.53
Horizonte
-0.53
Wikimédia
-0.52
EOR
-0.52
urie
-0.49
vains
-0.49
POSITIVE LOGITS
unspeak
1.35
indestru
1.34
shenan
1.28
reluct
1.27
unwarran
1.26
disagre
1.25
unlaw
1.20
uninten
1.20
ftu
1.17
impra
1.16
Activations Density 2.163%