for (int ii = 0; ii < 1000; ii++) { a[ii] = b[ii]+c[ii]; }The modified code is:
for (int ii = 0; ii < 1000; ii++) { prefetch(a[ii+k]); prefetch(b[ii+k]); prefetch(c[ii+k]); a[ii] = b[ii]+c[ii]; }prefetch(x) prefetches memory address x.
LOAD LOAD \ / \ / FADD | STThe first load and the second load can be overlapped. So the latency would be 4 cycles (if we just assume a singe issue processor).
int ii = 0; for (ii = 0; ii < 1000-k; ii++) { prefetch(a[ii+k]); prefetch(b[ii+k]); prefetch(c[ii+k]); a[ii] = b[ii]+c[ii]; } for (; ii < 1000; ii++) { a[ii] = b[ii]+c[ii]; }
int ii = 0; for (ii = 0; ii < k; ii++) { prefetch(a[ii]); prefetch(b[ii]); prefetch(c[ii]); } for (; ii < 1000-k; ii++) { prefetch(a[ii+k]); prefetch(b[ii+k]); prefetch(c[ii+k]); a[ii] = b[ii]+c[ii]; } for (; ii < 1000; ii++) { a[ii] = b[ii]+c[ii]; }If the cache block size is 16B instead of 8B, we could unroll the loop to reduce redundant prefetching requests.
int ii = 0; for (ii = 0; ii < k; ii=ii+2) { prefetch(a[ii]); prefetch(b[ii]); prefetch(c[ii]); } for (; ii < 1000-k; ii=ii+2) { prefetch(a[ii+k]); prefetch(b[ii+k]); prefetch(c[ii+k]); a[ii] = b[ii]+c[ii]; a[ii+1] = b[ii+1]+c[ii+1]; } for (; ii < 1000; ii++) { a[ii] = b[ii]+c[ii]; }
LOOP; LD F0, 0 (R1) Add F4, F0, F2 Mul F5, F4, F3 Store F5, 0 (R1) Add R1, R1, #-8 Br R1, R2, LOOP
LOOP; Store F5, 0(R1) Mul F5, F4, F3 Add F4, F0, F2 LD F0, -24(R1) Add R1, R1, #-8 Br R1, R2, LOOPWith all start-up/clean-up code
LD F0, 0(R1) Add F4, F0, F2 Mul F5, F4, F3 LD F0, -8(R1) Add F4, F0, F2 LD F0, -16(R1) LOOP; Store F5, 0(R1) Mul F5, F4, F3 Add F4, F0, F2 LD F0, -24 (R1) Add R1, R1, #-8 Br R1, R2, LOOP Store F5, -8(R1) Mul F5, F4, F3 Store F5, -16(R1) Add F4, F0, F2 Mul F5, F4, F3 Store F5, -24(R1)
ROB based mechanisms need to wait for the branch instruction to reach the head of the ROB to recover from a misprediction. Once this happens, all instructions after the branch can be flushed and the RAT can be set to the ARF. For a checkpoint based recovery, each branch prediction results in a new copy of the RAT. On a misprediction, the wrong instructions are flushed and the RAT of the checkpoint before the mispredicted branch is restored. For checkpointing, the recovery can resume as soon as the misprediction is detected. But for ROB, we must wait for the branch to retire before resumption. For example, Consider the below sequence of instructions. I1: Add R1, R4, R5 I2: Mul R2, R3, R6 I3: BEQZ R2, End I4: Sub R1, R4, R5 I5: Div R2, R3, R6 I6: Halt End: I7: Add R4, R2, R5 I8: Sub R3, R1, R6 I9: Halt If BEQZ was mispredicted, then I7, I8 would need to wait for I1, I2 to retire before resuming execution with the RAT now valid. If I1, I2 together were 3 cycles, then the misprediction results a stall of 3 cycles. However, with checkpointing, on detection of misprediction, the RAT is set to the state before the misprediction and I7, I8 would not need to wait for the branch to retire. Hence, this method is faster.
VLIW SuperScalar ---- ----------- statically combined dynamically combined more than one instruction more than one instruction Static Issue Dynamic Issue Compiler optimized Runtime optimized in hardware Can't react to latencies Can react to latencies in caches, memory, etc Larger view of program may Smaller window for optimization lead to better optimization Less complex hardware, less Complex hardware, more power power consumed And more..
I1: 0xa000 ADD R0, R0, R2 I2: 0xa004 ADD R1, R1, R0 I3: 0xa008 FADD F3, F2, F3 I4: 0xa00B ADD R1, R1, R0 I5: 0xa010 FADD F3, F2, F3 I6: 0xa014 LD R2, MEM[R6] I7: 0xa008 ADD R1, R1, R0 I8: 0xa01B FMUL F1, F5, F4 I9: 0xa020 LD F2, MEM[R0] IA: 0xa024 FADD F4, F1, F2 IB: 0xa028 LD F3, MEM[R0] IC: 0xa030 STORE MEM[R5], F4 ID: 0xa034 FADD F4, F3, F4 IE: 0xa038 ADD R2, R2, R0 IF: 0xa03B BR R2, 0x8000 ADD: simple ALU instruction (1-cycle latency) FADD: floating point instruction (1-cycle latency) FMUL: floating point instruction (2-cycle latency) LD: load instruction (2-cycle latency) STORE: store instruction (1-cycle latency) BR: branch instruction (1-cycle latency) Rename I9:F2 to F6 IA:F2 to F6 IB:F3 to F7 ID:F3 to F7 Now the VLIW schedule looks like: ----------------------------------------------- | I1 | I6 | I8 | ----------------------------------------------- | I9 | IB | I2 | ----------------------------------------------- | I4 | I3 | IE | ----------------------------------------------- | I7 | IA | I5 | ----------------------------------------------- | IC | ID | IF | -----------------------------------------------
0xa000 ADD R0, R0, R2 0xa004 ADD R1, R1, R0 0xa008 FADD R2, R2, F3 0xa00B FMUL R1, R1, R4 0xa010 STORE MEM[R2], R3 0xa014 FADD R3, F3, R4 0xa018 BR R3, 0x8000ADD: simple ALU instruction
ADD R0, R0, R2, NOP ADD R1, R1, R0, NOP FADD R2, R2, F3, NOP FMUL R1, R1, R4, STORE MEM[R2], R3 FADD R3,F3, R4, NOP BR R3, 0x8000 NOP
0xa000 ADD R0, R0, R5 0xa004 ADD R1, R1, R0 0xa008 FADD R2, R2, F3 0xa00B FMUL R1, R1, R4 0xa010 STORE MEM[R2], R3 0xa014 FADD R3, F3, R4 0xa018 BR R3, 0x8000ADD: simple ALU instruction
ADD R0, R0, R5 : FADD R2, R2, R3 ADD R1, R1, R0 : NOP STORE MEM[R2], R3 : FMUL R1, R1, R4 FADD F3, F3, R4 : NOP BR R3, 0x80000 : NOP