INDEX
Explanations
words related to power dynamics and societal issues, such as disenfranchisement, recalcitrance, and oppression
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1967
+0.31
1.1%
1705
+0.16
0.6%
2034
+0.13
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1363
+0.31
0.10
1705
+0.16
0.09
474
+0.13
0.05
Negative Logits
effe
-1.83
desir
-1.64
lidl
-1.63
dispen
-1.62
erec
-1.62
ivi
-1.58
igno
-1.56
wien
-1.56
noss
-1.55
noel
-1.55
POSITIVE LOGITS
,
0.75
ment
0.74
;
0.74
.
0.71
ative
0.71
and
0.71
ments
0.71
ation
0.70
ous
0.68
ly
0.68
Activations Density 0.644%