INDEX
Explanations
phrases that indicate involvement or participation in various contexts
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
282
+0.16
0.9%
186
+0.14
0.8%
485
+0.13
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
282
+0.16
0.08
69
+0.14
0.08
440
+0.13
0.05
Negative Logits
heads
-1.91
head
-1.76
ogg
-1.66
stown
-1.66
weights
-1.61
setminus
-1.56
again
-1.51
ctin
-1.49
isters
-1.49
matter
-1.45
POSITIVE LOGITS
»¿
2.64
ĥ½
2.31
¿
2.26
↵
2.23
↵
2.23
↵↵
2.23
↵
2.23
2.23
↵
2.23
2.23
Activations Density 0.542%