INDEX
Explanations
phrases related to societal issues and controversies
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1535
+0.23
0.8%
2034
+0.21
0.7%
1699
+0.17
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
382
+0.23
0.15
1535
+0.21
0.12
610
+0.17
0.10
Negative Logits
Lma
-0.88
Darío
-0.88
Lmfao
-0.86
viciss
-0.84
Darum
-0.80
suspic
-0.79
churrasco
-0.78
Hahah
-0.76
repug
-0.74
doctr
-0.74
POSITIVE LOGITS
These
0.66
↵↵
0.64
0.63
Such
0.63
This
0.63
They
0.59
%).
0.59
Resultat
0.59
:</
0.59
),),
0.58
Activations Density 0.793%