INDEX
Explanations
indications of negation and uncertainty in statements
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.22
0.8%
468
+0.12
0.4%
47
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
468
+0.22
0.04
1292
+0.12
0.04
972
+0.10
0.05
Negative Logits
<bos>
-2.28
intersper
-1.09
disambigu
-0.94
intermitt
-0.86
unsus
-0.85
intrigu
-0.84
endow
-0.83
unspeak
-0.82
ineffec
-0.81
guil
-0.81
POSITIVE LOGITS
signora
1.05
sorella
1.03
vacanza
0.98
paradiso
0.93
preghi
0.92
sfera
0.90
Muhamma
0.89
dott
0.89
muna
0.87
">/
0.86
Activations Density 0.325%