INDEX
Explanations
statements highlighting societal or political issues
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1741
+0.17
0.6%
2019
+0.11
0.4%
1967
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
781
+0.17
0.04
468
+0.11
0.04
47
+0.11
0.04
Negative Logits
maksi
-1.02
Abbé
-0.98
tramonto
-0.92
Shakspeare
-0.91
Cæsar
-0.87
Eccle
-0.87
Mémoires
-0.86
saar
-0.86
alip
-0.86
strick
-0.85
POSITIVE LOGITS
same
0.88
reason
0.78
biggest
0.73
easiest
0.72
exact
0.70
same
0.69
result
0.69
largest
0.66
best
0.66
way
0.66
Activations Density 0.230%