INDEX
Explanations
statements related to societal issues, morality, politics, and ethical behavior
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
2034
+0.14
0.4%
872
+0.11
0.3%
1741
+0.11
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.14
0.07
1959
+0.11
0.08
310
+0.11
0.05
Negative Logits
solidar
-0.82
demen
-0.81
quí
-0.81
promi
-0.80
notor
-0.80
umo
-0.80
melat
-0.79
dises
-0.79
robus
-0.76
albic
-0.76
POSITIVE LOGITS
Therefore
0.73
therefore
0.71
Therefore
0.65
therefore
0.64
Whether
0.61
Hence
0.59
whether
0.58
hence
0.55
unless
0.52
But
0.52
Activations Density 0.662%