INDEX
Explanations
mentions of nationalities or ethnic groups, particularly focusing on Chinese and Japanese references
New Auto-Interp
Negative Logits
umer
-0.15
anan
-0.14
ular
-0.14
United
-0.14
ĥĿ
-0.14
gings
-0.14
407
-0.14
954
-0.14
ULAR
-0.14
ffffffff
-0.13
POSITIVE LOGITS
-American
0.36
-Russian
0.32
-speaking
0.28
-Americans
0.27
-language
0.24
ischer
0.23
-born
0.22
-Israel
0.21
-European
0.20
istan
0.20
Activations Density 0.205%