INDEX
Explanations
phrases referring to demographic attributes such as race, ethnicity, and gender
phrases emphasizing racial and gender identities
New Auto-Interp
Negative Logits
Downloadha
-0.80
externalToEVAOnly
-0.72
iquid
-0.71
needles
-0.67
VIDEOS
-0.67
hyde
-0.66
livest
-0.66
netflix
-0.65
plementation
-0.65
downs
-0.64
POSITIVE LOGITS
course
0.98
whom
0.96
Colour
0.95
stature
0.93
colour
0.90
varying
0.89
course
0.87
color
0.83
renown
0.82
sted
0.81
Activations Density 0.096%