INDEX
Explanations
mentions of nationalities or ethnic identities
New Auto-Interp
Negative Logits
lech
-0.17
ary
-0.16
ley
-0.15
omorphic
-0.15
ductory
-0.15
Fle
-0.15
bidden
-0.14
enheim
-0.14
eson
-0.14
oleÄį
-0.14
POSITIVE LOGITS
-American
0.30
-Americans
0.24
-Russian
0.23
-flag
0.22
-born
0.21
ischer
0.20
-made
0.17
-speaking
0.17
-European
0.17
ische
0.16
Activations Density 0.240%