INDEX
Explanations
proper nouns such as names of people, places, organizations, and titles
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1741
+0.30
1.0%
50
+0.26
0.9%
2019
+0.17
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
16
+0.30
0.10
50
+0.26
0.07
1288
+0.17
0.07
Negative Logits
yoda
-0.93
pixar
-0.84
soeur
-0.81
monstre
-0.81
pikachu
-0.80
gardien
-0.78
😭😭
-0.76
Mère
-0.76
broderie
-0.75
bieber
-0.74
POSITIVE LOGITS
makro
0.78
Pä
0.74
Fö
0.73
Nö
0.72
ideolog
0.72
saar
0.70
Fakta
0.69
Jä
0.69
alkoh
0.69
Schrö
0.68
Activations Density 0.416%