INDEX
Explanations
mentions of the term "race"
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
521
+0.18
0.7%
555
+0.17
0.7%
1407
+0.15
0.6%
Correlated Neurons
Index
P. Corr.
Cos Sim.
521
+0.18
0.04
555
+0.17
0.03
1407
+0.15
0.03
Negative Logits
々木
-0.50
pí
-0.49
piatta
-0.49
Horário
-0.47
didat
-0.46
svolge
-0.46
Jennifer
-0.45
plein
-0.45
vostre
-0.45
quido
-0.44
POSITIVE LOGITS
race
1.34
Race
1.26
RACE
1.24
Race
1.20
races
1.12
hairc
1.10
Races
1.09
race
1.09
ecru
1.09
gaily
1.07
Activations Density 0.037%