INDEX
Explanations
references to various nationalities or ethnic groups
New Auto-Interp
Negative Logits
eson
-0.18
lech
-0.17
bidden
-0.16
sar
-0.15
aldi
-0.15
srv
-0.15
ductory
-0.14
less
-0.14
enheim
-0.14
lessly
-0.14
POSITIVE LOGITS
-American
0.29
-Americans
0.21
-Russian
0.21
-flag
0.20
-born
0.20
ization
0.18
ischer
0.18
ness
0.17
ized
0.17
-made
0.16
Activations Density 0.258%