INDEX
Explanations
words associated with gender, style, and fashion
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
897
+0.12
0.4%
321
+0.12
0.4%
849
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
981
+0.12
0.08
321
+0.12
0.07
1597
+0.09
0.04
Negative Logits
År
-0.86
saar
-0.85
Sén
-0.84
franz
-0.83
gallina
-0.83
kram
-0.83
meis
-0.83
pank
-0.82
incess
-0.81
palme
-0.81
POSITIVE LOGITS
AndEndTag
0.58
XtraReports
0.56
ruly
0.54
relenting
0.54
autorytatywna
0.53
mistak
0.52
pinulongan
0.52
thinkable
0.52
المناصب
0.51
sightly
0.51
Activations Density 0.531%