INDEX
Explanations
references to uncovering hidden information or secrets
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
314
+0.10
0.3%
674
+0.09
0.3%
321
+0.09
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
792
+0.10
0.04
655
+0.09
0.05
581
+0.09
0.04
Negative Logits
défend
-0.56
AfterEach
-0.50
smtplib
-0.50
reconnaît
-0.50
Aftermath
-0.45
entraîne
-0.45
Tanjung
-0.44
Muhamma
-0.44
dépasse
-0.44
accompagne
-0.44
POSITIVE LOGITS
sappi
0.93
<bos>
0.89
sembrano
0.80
parlano
0.80
scopri
0.79
morire
0.78
vogli
0.78
anse
0.75
abbandon
0.71
torner
0.71
Activations Density 0.306%