INDEX
Explanations
references to political issues and officials
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.23
0.8%
382
+0.17
0.6%
752
+0.14
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.23
0.08
878
+0.17
0.06
1959
+0.14
0.06
Negative Logits
disagre
-2.30
increa
-2.30
affor
-2.29
impra
-2.29
inev
-2.23
fta
-2.21
encomp
-2.20
reluct
-2.17
maneu
-2.17
squa
-2.16
POSITIVE LOGITS
We
1.07
It
1.06
").
1.04
They
1.03
There
1.00
”.
0.99
".
0.99
You
0.99
That
0.98
But
0.98
Activations Density 0.217%