INDEX
Explanations
gender-related words and phrases, including concepts of male dominance, female submission, and gender roles
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
1842
+0.12
0.3%
964
+0.10
0.3%
198
+0.09
0.3%
Correlated Neurons
Index
P. Corr.
Cos Sim.
1948
+0.12
0.05
1608
+0.10
0.02
1366
+0.09
0.03
Negative Logits
makro
-0.78
kaos
-0.73
aton
-0.69
kram
-0.67
lele
-0.66
fortn
-0.66
teras
-0.65
saba
-0.65
usta
-0.65
saar
-0.65
POSITIVE LOGITS
husbands
0.56
women
0.54
توضیحات
0.52
husband
0.52
wives
0.51
Mulher
0.50
homemaker
0.50
herself
0.50
obiety
0.49
kaufs
0.48
Activations Density 0.413%