CSE 661 Advanced Computer Architecture

PART 1 [84 pts.; 6 each]: Please do only 14 of the following 17 questions from the textbook:

{`1.	1.18 a, b 	10. 5.2 a
2.	2.9 a, b 	11. 5.3
3.	3.1 	        12. 5.4 a
4.	3.2 	        13. 5.9 a, b, c
5.	3.11 a, b, c 	14. 5.20 a
6.	3.18 	        15. 6.2 a, b
7.	4.14 	        16. 6.17 a, b
8.	4.16 	        17. 6.38
9.	5.1 all parts
`}

PART 2 [16 pts.]: Case Studies

  1. [2] How would you rewrite the following sequential code so that it can be run as two parallel threads on a dual-core processor? Try to balance the loads as much as possible between the two threads:
{`
int A[80], B[80], C[80], D[80];
for (i = 0 to 40)
{
A[i] = B[i] * D[2*i];
C[i] = C[i] + B[2*i];
D[i] = 2*B[2*i];
A[i+40] = C[2*i] + B[i];
}
`}
  1. [4] [Max one page] Please do A.23 from the textbook.
  2. [5] [Max one page] The Top 500 list, published on top500.org, categorizes the fastest scientific machines in the world accounting to their performances. Visit this website and summarize imperative characteristics of four machines among top ten in a table format.
  3. [5] Consider a system with two multiprocessors with the following configurations:
    • Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word.
    • Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word.

Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words, is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time.