INDEX
Explanations
sentences that state facts or assertions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
23
+0.17
1.0%
478
+0.14
0.8%
212
+0.12
0.7%
Correlated Neurons
Index
P. Corr.
Cos Sim.
23
+0.17
0.06
212
+0.14
0.05
304
+0.12
0.04
Negative Logits
ĭ
-4.70
ģ
-4.69
Ļª
-4.68
»¿
-4.47
Īĺ
-4.44
į
-4.36
Ĥ¬
-4.33
İ
-4.26
Ģ
-4.23
ī
-4.23
POSITIVE LOGITS
yours
2.03
hers
2.00
imperative
1.82
ours
1.79
our
1.69
APTER
1.62
my
1.58
an
1.55
Cookie
1.41
doubly
1.39
Activations Density 0.302%