INDEX
Explanations
characteristics related to human identity or personal attributes like race, ethnicity, religion, sexual orientation, nationality, and physical features
terms related to identity and discrimination based on various characteristics
New Auto-Interp
Negative Logits
EMS
-0.76
imar
-0.76
Oper
-0.75
ERG
-0.69
INTER
-0.67
vae
-0.66
aug
-0.66
ald
-0.66
Dispatch
-0.65
Rak
-0.65
POSITIVE LOGITS
ancestry
0.88
affiliation
0.86
backgrounds
0.85
ethnicity
0.81
coloring
0.77
discrimination
0.77
prejudice
0.76
affili
0.76
preference
0.76
stripe
0.74
Activations Density 0.096%