[0 point](Cuda) In the G80 architecture, each SM has 8K number of register files. Let's say that one thread consumes 24 registers, how many threads can be run on one SM? Assume that each thread consumes only 64B shared memory and there is a total 64KB shared memory space on one SM. Always a multiple of 32 threads can be executed together.
(The problem is continued). We assume that the memory latency is always 100 cycles and the execution latency of LD and FADD is 4 cycles. The processor is a pipelined machine and Fetch/Decode/Schedule takes 1 cycle each. There are total 512 threads to execute. How many cycles does it take to finish all 512 threads? The machine has a bandwidth of executing (fetch/decode/schedule) 32 threads at one cycle (this is called a warp.) Let's say that N number of threads can be executed together. (N is determined by the previous question). The processor waits until all N threads retire and then it starts to fetch the remaining threads and executes the code again. The G80 architecture is an SMT machine.
LD R1 mem[R5]
FADD R2, R1, R3
FADD R3, R4, R5
Hint
If N is 3, the processor executes the code in the following way
cycle 1: fetch warp #1 (LD)
cycle 2: fetch warp #2 (LD) decode warp #1
cycle 3: fetch warp #3 (LD) decode warp #2 schedule warp #1
cycle 4: fetch warp #1 (FADD) decode warp #3 schedule warp #2 LD warp #1
cycle 5: fetch warp #2 (FADD) decode warp #1 (FADD) schedule warp #2 LD warp #2
cycle 6: fetch warp #3 (FADD) decode warp #2 (FADD) LD warp #3
....
cycle 104: LD warp #1 done
cycle 105: LD warp #2 done ...
How can this problem be changed if we consider memory bandwidth effects?
The CUDA programming has a block concept. How can this problem be changed if we consider a block?