INDEX
Explanations
references to specific video games and movie titles
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.34
1.2%
1150
+0.17
0.6%
184
+0.15
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1438
+0.34
0.03
184
+0.17
0.02
284
+0.15
0.04
Negative Logits
reluct
-3.01
increa
-2.99
inev
-2.93
affor
-2.88
fuf
-2.86
depic
-2.83
disagre
-2.81
unden
-2.80
volunte
-2.79
secon
-2.75
POSITIVE LOGITS
<bos>
1.54
.
1.07
。
0.96
.
0.93
).
0.91
;
0.90
."
0.89
RectangleBorder
0.88
!
0.88
.”
0.88
Activations Density 0.157%