INDEX
Explanations
URLs or web links in the text
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
284
+0.11
0.6%
358
+0.11
0.6%
455
+0.10
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
28
+0.11
0.02
336
+0.11
0.02
284
+0.10
0.02
Negative Logits
ellow
-1.52
rait
-1.42
}</
-1.41
noreply
-1.37
ired
-1.36
iÄĩ
-1.30
cross
-1.30
esque
-1.29
mes
-1.25
ousse
-1.24
POSITIVE LOGITS
doibase
1.46
license
1.41
discretion
1.33
lic
1.32
yourselves
1.32
pntd
1.31
rieved
1.31
CLAIM
1.30
://
1.28
forge
1.27
Activations Density 0.038%