CS4290/CS6290HPCA

School of computer science

Georgia Institute of Technology

CS4290/CS6290HPCA Fall 2010

Programming assignment #2
Due: (part 1, program): Wednesday, November 3, 6:00 pm, Report: Thursday, Nov 4 before class ~~Friday, October 29, 6:00 pm, report: Tuesday , Nov 2 before class~~ Hyesoon Kim, Instructor

This is an individual assignment. You can discuss this assignment with other classmates but you should do your assignment individually. Please follow the submission instructions. If you do not follow the submission file names, you will not receive the full credit. Please check the class homepage to see the latest update. Your code must run on killerbee[1-5].cc.gatech.edu with g++4.1 or warp clusters

Overview
This assignment is composed of two parts: Part 1: building the simulator, Part 2: simulation and analysis

Part 1(100 pts): Building a Superscalar processor

In this assignment, you will improve your pipeline design to have (1) branch predictor and (2) Superscalar and (3) Tomasulo's algorithm (register renaming and out of order scheduling). You will extend your Lab #1 pipeline design. We provide add_me.txt file is provided. Please add the code at appropriate places. userknob.h and simknob.h files are updated. Please download the new files.

(1) Implement a g-share branch predictor

Add a g-share branch predictor in the FE stage. When a processor fetches a conditional branch instruction (cf_type must be CF_CBR), it accesses the branch predictor. All other branches, we assume that they are correctly predicted. There are no pipeline bubbles for other branches. If a branch is mispredicted, the processor should not fetch instructions from the correct path. i.e., the processor should not call the get_op function. Instead, it should stall the pipeline until the branch instruction is resolved. After the branch is resolved, in the following cycle the correct PC address is forward to the PC latch. From the next cycle the processor can start to fetch instructions from the correct path. Note that in a real hardware, the processor will keep fetching wrong-path instructions until a branch is resolved. Once a misprediction is discovered, the processor flushes the pipeline and starts to fetch instructions from the correct path. Since we are building a trace-driven simulator, we cannot send wrong-path instructions into the pipeline. That's why we are stalling the pipeline to model the time delay between a branch prediction and branch resolution. To simplify the problem, after a branch is predicted, you also update the branch predictor including BHR with the correct value at the same cycle. In a real hardware, the branch predictor is speculatively updated with predicted values. When a branch misprediction is detected, the BHR value is recovered. However, since we do not fetch wrong-path instructions, this step is unnecessary. The 2bit counters in the PHT table are initialized with "10" weekly taken. Note that, in a real hardware, when a branch misprediction is detected, the processor cannot start to rename until all correct path instructions are retired. Again, to simplify the simulator, the processor starts to resume as soon as the correct-path PC value is available.

pipeline design

(2) a superscalar processor.

Now you extend your simulator to handle more than one instruction at a cycle. It can fetch/decode/issue/execute/retire "KNOB_ISSUE_WIDTH" number of instructions. Fetch, decode, issue, retire must be in order. We assume that the pipeline has "KNOB_ISSUE_WIDTH" number of functional units. All functional units are pipelined.

(3) Implementing out of order scheduling algorithm

To support out of order scheduling and in-order commit, now you change your pipeline stages. Instructions will be renamed and instructions can be executed out of order. Instructions must be retired in order. To implement an out of order processor, there is a scheduler. To support an in-order retirement, a ROB is also required. In this assignment, you do not need latches other than the Fetch stage. To reduce unnecessary code debugging, I suggest removing all the pipeline latches other than FE_latch. You can remove the pipeline latch printf statements. Pipeline changes: Instead of FE/ID/EX/MEM/WB stages, now the processor has FE/ID/ISSUE/EX/WB stages. The major features of each pipeline are as follows:

FE_stage: it accesses the I-cache and it reads an op (instruction) from a trace file. It access a branch predictor and predicts the next target address. Only conditional branches use a gshare branch predictor and all other branches are always correctly predicted. We assume that the processor has a BTB so it knows whether an instruction is a branch or not at FE_stage. You do not need to model the BTB. The fetched instructions are stored in the FE_stage latch.

ID_stage: Instructions are decoded at the beginning of the ID stage. Instructions are sent to the ROB here. If there is no space in the ROB, the processor stalls the FE_stage.

ISSUE_stage: In this stage, the processor inserts instructions into the scheduler. Instructions access the register file when they enter the scheduler. Up to KNOB_ISSUE_WIDTH instructions can be sent to the scheduler. Ready instructions are selected in this stage. The ready instructions are sent to the EX stage. Only up to KNOB_ISSUE_WIDTH instructions can be selected. When the scheduler selects instructions, it also checks the status of functional units. There are KNOB_ISSUE_WIDTH number of functional units. If there is no available functional unit, the instruction cannot be scheduled. (since the number of functional units is the same as KNOB_ISSUE_WIDTH, in this assignment there will no such a case that the processor cannot schedule instructions because of not enough functional units.) KNOB_ISSUE_WIDTH number of memory instructions can access the d-cache. You must provide the feature of in-order scheduling

EX_stage: All the functional units are pipelined.

Now, the d-cache is accessed here. The processor also searches the load/store buffer (you do not need to implement the load/store buffer in this assignment since the d-cache is always hit in this assignment.) A branch instruction is also resolved here. When an instruction finishes execution, at that moment, it frees the corresponding scheduler entry.

WB_stage: An instruction is retired at this stage. A destination register is available at this moment. The branch target address is sent to the PC register at this stage. Hardware structure size
The scheduler size is determined by KNOB_SCHED_SIZE. Rob size is determined by KNOB_ROB_SIZE. The number of physical register is the same as KNOB_ROB_SIZE. (so your simulator probably does not need to have a separate physical register file.) The load and store buffer. We also assume that the number of load/store buffer size is the same is KNOB_ROB_SIZE. Data forwarding
An instruction becomes eligible for execution in the cycle that follows the one in which the last operand was forward from the EX stage. There are KNOB_ISSUE_WIDTH number of data forwarding paths from EX stage.

KNOBS

New KNOBS
KNOB_ROB_SIZE: it sets the number of entries in the Rob (default value is 64)

KNOB_SCHED_SIZE: it sets the number of entries in the scheduler (default value is 8)

KNOB_OOO_SCHEDULER: It sets the scheduling policy. ( 0: in order scheduling 1: out of order scheduling)
KNOB_GHR_LENGTH: It decides the length of GHR. The number of gshare predictor entry is 2^(KNOB_GHR_LENGTH). KNOB_DEBUG_PRINT: print a debug message. (0: no debug message 1: print debug message )

KNOB_ISSUE_WIDTH: It sets the number of fetch/decode/issue/schedule/execute/retire width

KNOBS from Assignment #1
KNOB_OUTPUT_FILE: set the output filename of the print_stats function

KNOB_TRACE_NAME: set the input trace file name

KNOB_READ_TRACE: (1): read trace and execute the simulator (0): execute the binary

You should not set 1 for both KNOB_WRITE_TRACE and KNOB_READ_TRACE. Only one of them has to be 1.

KNOB_MAX_SIM_COUNT: set the maximum cycle_count for the simulation

KNOB_MAX_INST_COUNT: set the maximum inst_count for the simulation

KNOB_PRINIT_PIPE_FREQ: set the frequency of calling print_pipeline() function

Grading

We will check branch predictor accuracies as we vary the length of GHR.

We will check IPC values to grade your homework. We will vary issue width.

We will also use the KNOB_DEBUG_PRINT knob to check in-order/OOO scheduling and in-order retirement. You must replace print_stats() function with the new one at add_me.txt file.
Please make it sure your simulation can be ended until the end of traces and remove all debug statements that you added.

Submission Guide
Please do not turn in pzip files(trace files). Trace file sizes are so huge so they will cause a lot of problems.
(Tar the lab2 directory. Gzip the tarfile and submit lab2.tar.gz file at T-square)
cd pin-2.8-36111-gcc.3.4.6-ia32_intel64-linux/source/tools

cd lab2
make clean
rm *.pzip
cd ..
tar cvf lab2.tar lab2
gzip lab2.tar

Please do not let your zombie jobs use all the machine resources!!

Note: Assignment #2 is significantly longer than Assignment #1. Please start early and please read faq before you start.

Part 2 20 pts : Simulation and analysis
Due: 11/2/10 before class. Hard copy only
Using your simulator, you will do performance analysis. You will generate traces which are at least longer than 10,000 instructions. Any applications are fine. Please do not turn in your traces!!

Include your simulation results in the report. The default configurations are
gshare history length: 8
issue width: 2
out of order scheduler
ROB size: 64

Vary the issue width from 1 to 3 (1, 2 and 3) and discuss performance impacts. Test both in order scheduling and out of order scheduling mechanisms. Discuss how the performance improvements are different.
(Hint: why (IPC (width=2) - IPC(width=1)) >> (IPC(width=3) - IPC(width=2)) ? )
Write two simple benchmarks that could have different branch prediction accuracies. Differences should be more than 5%. You could use existing benchmarks. Simple floating point benchmarks such as matrix multiplications, PDE calculations typically have high branch prediction accuracy and complicated benchmarks such as bzip2, the simulator itself would have high branch misprediction rate. Demonstrate the performance as you vary gshare history length (4,8,12).