INDEX
Explanations
terms related to emotional or physical suffering
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
125
+0.17
1.0%
334
+0.14
0.8%
474
+0.14
0.8%
Correlated Neurons
Index
P. Corr.
Cos Sim.
125
+0.17
0.02
474
+0.14
0.03
334
+0.14
0.02
Negative Logits
Ń
-3.28
ĥ½
-3.08
Īĺ
-3.03
Ħ
-2.96
Ģ
-2.94
µ
-2.91
¬
-2.91
IJ
-2.89
¤
-2.87
čč
-2.77
POSITIVE LOGITS
here
1.62
from
1.53
yl
1.51
lord
1.51
worse
1.43
river
1.43
ant
1.40
curvature
1.39
nowadays
1.39
=>
1.38
Activations Density 0.103%