INDEX
Explanations
references to discrimination and marginalization of specific groups
New Auto-Interp
Negative Logits
aget
-0.17
ë³µ
-0.15
adro
-0.15
meden
-0.15
venile
-0.15
ÙĪÙĦÙĬ
-0.14
ÙĩاÛĮ
-0.14
ä¿
-0.14
?type
-0.14
ADED
-0.14
POSITIVE LOGITS
certain
0.32
minorities
0.25
Certain
0.24
vulnerable
0.24
Minor
0.23
Certain
0.23
people
0.23
groups
0.22
others
0.22
Minor
0.21
Activations Density 0.144%