INDEX
Explanations
"support" or "backing" associated with policies or decisions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
50
+0.23
0.9%
605
+0.10
0.4%
573
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
710
+0.23
0.04
889
+0.10
0.04
1892
+0.09
0.04
Negative Logits
<bos>
-2.79
const
-0.76
-0.74
public
-0.71
did
-0.71
/*
-0.68
have
-0.68
else
-0.68
-0.68
has
-0.68
POSITIVE LOGITS
affor
2.03
impra
1.98
increa
1.93
stockholm
1.91
Juf
1.87
reluct
1.82
accla
1.80
fta
1.80
maneu
1.79
mef
1.78
Activations Density 0.718%