INDEX
Explanations
words indicating nationalities or ethnic identities
New Auto-Interp
Negative Logits
ord
-0.15
the
-0.14
aine
-0.14
tam
-0.14
ourcem
-0.14
less
-0.13
lessness
-0.13
in
-0.13
_
-0.13
ifs
-0.13
POSITIVE LOGITS
-American
0.17
-Russian
0.16
ization
0.16
kest
0.15
-flag
0.15
ize
0.14
iqueta
0.14
issan
0.14
throp
0.14
izes
0.14
Activations Density 0.145%