INDEX
Explanations
phrases that discuss race and ethnicity in the context of judgment and equality
New Auto-Interp
Negative Logits
arges
-0.15
#aa
-0.14
meanings
-0.14
actionTypes
-0.14
Ð®ÐĽ
-0.13
futures
-0.13
UX
-0.13
harms
-0.13
ux
-0.13
داÙħ
-0.13
POSITIVE LOGITS
race
0.49
race
0.40
gender
0.39
religion
0.38
age
0.37
Race
0.36
ethnicity
0.36
nationality
0.35
Race
0.34
sex
0.33
Activations Density 0.250%