INDEX
Explanations
phrases related to politics and power dynamics
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1438
+0.19
0.6%
1842
+0.16
0.5%
1150
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1438
+0.19
0.10
284
+0.16
0.10
332
+0.13
0.07
Negative Logits
unwarran
-1.72
reluct
-1.60
inev
-1.60
disagre
-1.58
volunte
-1.51
increa
-1.51
affor
-1.47
desir
-1.46
uninten
-1.45
excru
-1.44
POSITIVE LOGITS
.
0.91
.”
0.72
↵↵
0.72
.~
0.71
."
0.70
。
0.70
).
0.69
.
0.69
↵↵↵
0.69
!
0.68
Activations Density 0.765%