INDEX
Explanations
negative sentiments or refusals
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1757
+0.14
0.5%
411
+0.14
0.4%
674
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
411
+0.14
0.06
581
+0.14
0.06
1622
+0.11
0.04
Negative Logits
soigne
-0.96
habile
-0.94
Chapitre
-0.94
épu
-0.90
Intere
-0.85
Simult
-0.85
Confu
-0.85
hcm
-0.84
triomphe
-0.84
désol
-0.83
POSITIVE LOGITS
autorytatywna
0.82
<bos>
0.69
wont
0.66
be
0.64
Cyfeiriadau
0.63
necessarily
0.62
intptr
0.62
won
0.62
disambiguazione
0.60
Personensuche
0.59
Activations Density 0.159%