INDEX
Explanations
references related to social issues, public statements, and inappropriate behavior
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
644
+0.09
0.2%
229
+0.08
0.2%
604
+0.08
0.2%
Correlated Neurons
Index
P. Corr.
Cos Sim.
644
+0.09
0.05
100
+0.08
0.04
1186
+0.08
0.03
Negative Logits
impractica
-1.33
impra
-1.25
unwarran
-1.22
uninten
-1.18
increa
-1.18
thut
-1.17
fta
-1.17
disagre
-1.15
reluct
-1.14
ecru
-1.13
POSITIVE LOGITS
unacceptable
0.60
tolerance
0.55
acts
0.55
violence
0.54
behavior
0.54
zero
0.52
ZERO
0.51
anyone
0.51
Zero
0.50
behaviors
0.50
Activations Density 0.417%