INDEX
Explanations
phrases related to government, policy, and accountability
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.20
0.9%
604
+0.15
0.7%
1499
+0.11
0.5%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1499
+0.20
0.16
344
+0.15
-0.01
1048
+0.11
0.04
Negative Logits
<bos>
-3.34
anță
-0.99
realize
-0.83
AVOR
-0.83
ⓧ
-0.82
utilize
-0.82
ății
-0.80
color
-0.80
personalize
-0.79
neighborhoods
-0.79
POSITIVE LOGITS
soulign
1.43
McLaugh
1.40
Juf
1.36
tucson
1.29
véhic
1.28
unlaw
1.26
Bartholo
1.26
Gorb
1.26
impractica
1.25
Rine
1.21
Activations Density 3.866%