INDEX
Explanations
mentions of gender, particularly focusing on males and their descriptions
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
554
+0.13
0.4%
390
+0.12
0.4%
168
+0.12
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1516
+0.13
0.03
554
+0.12
0.03
765
+0.12
0.03
Negative Logits
PhysRevLett
-0.49
lanka
-0.49
karna
-0.48
ntos
-0.47
ConverterFactory
-0.47
kuli
-0.47
resizingMask
-0.46
іга
-0.46
kuf
-0.45
adipis
-0.45
POSITIVE LOGITS
male
1.18
Male
1.09
Male
1.08
male
1.03
MALE
1.01
males
0.97
Males
0.92
Males
0.89
témoignage
0.87
actionTypes
0.84
Activations Density 0.052%