INDEX
Explanations
references to race and issues affecting people of color
New Auto-Interp
Negative Logits
abis
-0.15
hone
-0.15
ât
-0.14
ÅĻej
-0.14
elik
-0.14
ylland
-0.14
fty
-0.14
iance
-0.14
heid
-0.14
ntag
-0.13
POSITIVE LOGITS
color
0.27
colour
0.25
means
0.23
Means
0.23
goodwill
0.22
Means
0.21
whom
0.20
integrity
0.20
means
0.19
substance
0.18
Activations Density 0.030%