INDEX
Explanations
words related to political parties and historical figures
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.28
1.0%
1967
+0.16
0.6%
394
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
227
+0.28
0.12
143
+0.16
0.09
394
+0.12
0.08
Negative Logits
affez
-0.86
rispond
-0.79
<bos>
-0.76
riemp
-0.75
allarg
-0.73
isolato
-0.72
<?
-0.71
🕗
-0.70
cammin
-0.69
dicono
-0.68
POSITIVE LOGITS
strto
0.51
Koning
0.49
-
0.47
ity
0.47
ism
0.46
ous
0.45
("")
0.45
'
0.44
(:)
0.43
’
0.43
Activations Density 0.927%