INDEX
Explanations
words associated with quality, value, and effectiveness
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
320
+0.15
0.8%
6
+0.13
0.7%
200
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
155
+0.15
0.08
343
+0.13
0.06
56
+0.11
-0.01
Negative Logits
²
-2.15
·¸
-1.99
ĻĤ
-1.99
Ĥ¬
-1.92
IJ
-1.92
½
-1.73
Ĵ
-1.63
Īĺ
-1.62
riber
-1.59
openh
-1.58
POSITIVE LOGITS
itself
1.82
advantage
1.65
ulous
1.55
[$
1.45
against
1.43
remains
1.43
otal
1.43
[@
1.40
constituents
1.39
consisted
1.39
Activations Density 0.887%