INDEX
Explanations
references to diversity and inclusivity across various backgrounds and identities
New Auto-Interp
Negative Logits
icator
-0.17
emode
-0.15
isque
-0.15
eral
-0.15
üf
-0.14
.bc
-0.14
igram
-0.14
erald
-0.13
iosis
-0.13
aversal
-0.13
POSITIVE LOGITS
race
0.64
races
0.63
Races
0.55
Race
0.54
Race
0.53
race
0.51
_race
0.46
ethnicity
0.46
ethnic
0.43
racial
0.42
Activations Density 0.221%