INDEX
Explanations
references to nationalities and countries, specifically focusing on Chinese and Swedish entities
New Auto-Interp
Negative Logits
gings
-0.15
bidden
-0.15
alling
-0.14
UBLISH
-0.14
reput
-0.14
aller
-0.14
lect
-0.14
ä¸Ńåľĭ
-0.14
enton
-0.14
atr
-0.13
POSITIVE LOGITS
-American
0.43
-speaking
0.34
-Americans
0.33
-Russian
0.33
-born
0.30
-language
0.30
-Israel
0.28
-flag
0.25
ischer
0.25
-made
0.24
Activations Density 0.233%