INDEX
Explanations
expressions of enjoyment or positive experiences
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
160
+0.15
0.8%
43
+0.12
0.6%
66
+0.11
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
231
+0.15
0.01
119
+0.12
0.01
27
+0.11
0.01
Negative Logits
ĥ½
-5.03
Īĺ
-4.84
·¸
-4.83
ī
-4.79
©
-4.73
»
-4.67
®
-4.62
¬
-4.58
½
-4.51
µ
-4.51
POSITIVE LOGITS
technology
1.68
characterization
1.61
placement
1.50
parent
1.46
activity
1.42
prior
1.38
("1.37
technologies
1.37
choice
1.37
techniques
1.37
Activations Density 0.002%