INDEX
Explanations
questions and phrases related to the concept of "how."
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
70
+0.14
0.8%
23
+0.12
0.6%
51
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
70
+0.14
0.07
496
+0.12
0.06
504
+0.11
0.06
Negative Logits
hesis
-1.68
whose
-1.67
achus
-1.61
burg
-1.58
ylvania
-1.57
(\#
-1.55
(“
-1.52
whom
-1.51
aho
-1.44
footnote
-1.42
POSITIVE LOGITS
ĥ½
2.52
¾
2.35
↵↵
2.24
<|outofrange|>
2.24
č↵
2.24
↵
2.24
<|outofrange|>
2.24
↵
2.24
2.24
<|outofrange|>
2.24
Activations Density 0.108%