INDEX
Explanations
references to race
references to race and related concepts
New Auto-Interp
Negative Logits
cit
-0.80
UNE
-0.79
erva
-0.75
irs
-0.74
anmar
-0.73
orage
-0.72
hiba
-0.71
unction
-0.71
ickson
-0.70
psons
-0.70
POSITIVE LOGITS
course
0.91
Equality
0.88
blind
0.87
prejudice
0.83
slurs
0.82
supremacy
0.82
relations
0.82
Discrimination
0.81
bending
0.76
purity
0.75
Activations Density 0.029%