INDEX
Explanations
phrases that introduce research findings or reports
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
466
+0.13
0.7%
321
+0.12
0.6%
478
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
238
+0.13
0.04
404
+0.12
0.04
176
+0.11
0.04
Negative Logits
gs
-1.54
gio
-1.44
(.
-1.37
thing
-1.34
hler
-1.33
himself
-1.31
resuspended
-1.26
acer
-1.26
ersion
-1.26
bits
-1.25
POSITIVE LOGITS
ĻĤ
3.03
ĥ½
2.84
↵
2.82
↵↵
2.82
↵
2.82
<|outofrange|>
2.82
↵
2.82
<|outofrange|>
2.82
↵
2.82
↵ ³³³
2.82
Activations Density 0.220%