INDEX
Explanations
mentions of people of color
references to marginalized groups, specifically people of color
New Auto-Interp
Negative Logits
Xi
-0.78
Niet
-0.72
ãĤ´
-0.72
WAR
-0.71
Nex
-0.69
ertodd
-0.68
ERG
-0.68
Sut
-0.68
sg
-0.67
chn
-0.67
POSITIVE LOGITS
blind
0.85
anguage
0.83
minorities
0.73
coded
0.69
backgrounds
0.69
queer
0.68
color
0.67
slurs
0.67
stripes
0.67
="#
0.65
Activations Density 0.012%