INDEX
Explanations
pronouns related to gender
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
397
+0.11
0.4%
1637
+0.11
0.3%
2011
+0.10
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
131
+0.11
0.07
397
+0.11
0.07
1637
+0.10
0.08
Negative Logits
suspic
-1.23
thut
-1.23
gend
-1.22
tew
-1.18
seiz
-1.17
fta
-1.17
sii
-1.17
aen
-1.15
desir
-1.11
stockholm
-1.11
POSITIVE LOGITS
himself
1.09
His
1.09
his
1.03
his
1.00
himself
1.00
His
0.99
He
0.96
Himself
0.93
He
0.92
he
0.88
Activations Density 0.558%