INDEX
Explanations
words related to racism and discrimination
New Auto-Interp
Neuron Alignment
Index
Value
% of L₁
555
+0.15
0.5%
1480
+0.12
0.4%
468
+0.11
0.4%
Correlated Neurons
Index
P. Corr.
Cos Sim.
555
+0.15
0.02
1480
+0.12
0.02
1425
+0.11
0.02
Negative Logits
Nuorodos
-0.60
**********/
-0.53
Kön
-0.52
uttosto
-0.51
svolge
-0.50
Debido
-0.50
...');
-0.50
Economía
-0.49
meras
-0.48
citroen
-0.48
POSITIVE LOGITS
racism
1.03
racist
0.96
Racism
0.93
Racism
0.91
racist
0.77
racism
0.76
racial
0.76
Rac
0.76
shewn
0.73
Racial
0.73
Activations Density 0.059%