School of computer science
Georgia Institute of Technology
CS4803DGC, Spring 2011
Programming Assignment #1
Due: Extra 10% Friday, Feb. 18 6:00 pm (extension to Monday Feb 21 6:00 pm)
Hyesoon Kim, Instructor
Introduction
This is an individual assignment. In this assignment, you will design several micro-benchmarks that will reveal the performance of GPUs.
You have to execute the kernel for longer period to measure the effective execution time. Submit a report that describes the basic idea of benchmarks and analysis of the results.
Peak Performance Benchmark
You write a benchmark that achieves the best FLOPS. Hint: Use FMA.
You calculate the FLOPS by counting the number of floating point operations manually. (i.e., how many fp instructions in a loop and how many times the loop is executed?)
In the report, you include the results. You vary the number of threads and blocks and also report the number. Do the peak FLOPS vary as you vary the number of threads or blocks?
Explain the results.
Memory latency measurement benchmark
We will start from a simple sequential memory behavior.
You vary the starting address of the memory instruction. See whether the performance is also varied.
Coalesced vs. Uncoalesced benchmarks
Change the memory address access patterns to generate uncoalesced memory addresses.
For example, LD A[tid+3] will generate a stride access pattern.
Do you see performance delta between sequential memory addresses and uncoalesced memory accesses?
Arithmetic Intensity
Write a program that you can change a various arithmetic intensity. Plot the performance vs. arithmetic intensity.
Vary the number of threads and blocks (at least 15 cases).
Peak memory bandwidth
Write a program that generate peak memory bandwidth. Using the same kernel and vary the number of threads and blocks. The results should show bandwidth saturation after a certain number of threads and blocks. Compare your results with the bandwidth test program in CUDA SDK.