School of Computer Science Georgia Institute of Technology

CS4803DGC, Spring 2011 Prof. Hyesoon Kim

Sample Quiz

| Name :                 |
|------------------------|
| GTID :                 |
| Problem 1 (10 points): |
| Problem 2 (10 points): |
| Problem 3 (20 points): |
| Problem 4 (10 points): |
| Problem 5 (10 points): |
| Problem 6 (10 points): |
|                        |

Total (70 points):\_\_\_\_\_

Note: Please be sure that your answers to all questions (and all supporting work that is required) are contained in the space provided.

Note: Please be sure your name is recorded on each sheet of the exam.

## GOOD LUCK!

Name:

Problem 1 (10 points):

How many cycles would it take to execute the following code segments in the following pipeline design? Assume that register write and read can be performed at the same cycle.



Figure 1: case c

 ADD RO, R1, R2 XOR R2, R1, RO
 ADD RO, R1, R2 AND RO, R3, R4
 AND R2, R1, x0 ADD R1, R6, R1
 ADD R7, R1, R2 BRz X // This branch is taken.
 X XOR R2, R3, R0

Problem 2 (10 points):

Part a (5 pts) List at lest 2 hardware structures that must be replicated in a data path to support SMT architectures.

Part b. (5 pts) Discuss at least two major differences between designing game console architectures and desktop processors.

Problem 3 (20 points):

Part a. (5 pts) Xbox 360 employees several write merge buffers (store gathering buffers). Discuss benefits of these buffers.

Part b. (5 pts) If the cache block size is 4B instead of 128B, is the write merge buffer still useful? Explain the reason.

Part c. (5 pts) What is the cache-set-locking mechanism and what's the benefit of using it at XBox360?

Part d. (5 pts) Discuss negative effects when prefetching requests are not accurate.

Problem 4 (10 points)

Part a. (5 pts) A GPU has 8 SMs and each SM has 512 floating point units. The latency of ADD/MUL operation is 1 cycle each and the latency of DIV is 4 cycles. The frequency of SM is 1GHz. What is the peak flop/s?

Part b. (5 pts) Discuss differences between superscalar processors and SIMD processors.

Problem 5 (10 points) Describe how you would implement the following code in CUDA.

```
for (ii = 1; ii < 200000; ii=ii+2) {
    sum += X[ii-1] + X[ii];
}</pre>
```

Name:

Problem 6 (10 points) A new processor has 5-wide SIMD units. SIMADD, SIMLDB, SIMLDW, along with ADD, LDB (Load Byte), LDW (Load Word), BR. The following code will be translated into a RISC ISA as follows. Convert the code into a SIMD style using the above instructions.

```
for (ii = 1; ii < 200000; ii=ii+2) {</pre>
   sum += X[ii-1] + X[ii];
}
(a) origianl source code (X is double word type)
     MOV RO, \1
LOOP ADD R1, R3, R0
     ADD R2,
              R1, -1
     LDW
         R5
               MEM[R1]
     LDW
          R6
               MEM[R2]
     ADD RO
               RO, \#2
     BR.LESS
               RO, \#200000, LOOP
```

(b) RISC code

ADD R1, R3, R0 means R1=R3+R0. BR.LESS R0, #200000, LOOP means, if R0 is less than 200000 jump to LOOP. MOV R0, #1 means R0=#1, LD R4 MEM[R1] means R4=MEM[R1].