INDEX
Explanations
phrases related to surveillance and control
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
674
+0.10
0.3%
1042
+0.10
0.3%
752
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
752
+0.10
0.05
1533
+0.10
0.01
2016
+0.09
0.06
Negative Logits
reluct
-1.52
guarante
-1.43
encomp
-1.42
impractica
-1.42
unlaw
-1.41
shenan
-1.41
affor
-1.40
indor
-1.39
increa
-1.39
resear
-1.38
POSITIVE LOGITS
his
1.10
their
1.06
<bos>
1.01
their
0.95
his
0.94
its
0.93
her
0.90
your
0.83
suas
0.83
seu
0.82
Activations Density 0.668%