INDEX
Explanations
terms related to gender, specifically the mentions of male and female
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
156
+0.19
1.1%
129
+0.14
0.8%
71
+0.12
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
318
+0.19
0.01
134
+0.14
0.02
269
+0.12
0.02
Negative Logits
applicable
-1.74
happening
-1.41
ters
-1.39
inson
-1.37
olen
-1.35
metast
-1.30
áz
-1.29
tighter
-1.29
stolen
-1.27
subseteq
-1.26
POSITIVE LOGITS
¢
2.19
¬
2.12
ļ
2.12
¾
1.98
į
1.96
ĻĤ
1.96
ī
1.90
ģ
1.88
ľ
1.82
ĩ
1.81
Activations Density 0.091%