Introduction
This is an individual assignment. In this assignment, you will implement a tiled matrix multiplication using CUDA.
1) Untar lab1.tar.gz into ~/NVIDIA_GPU_Computing_SDK/C/src
Instruction:
cd ~/NVIDIA_GPU_Computing_SDK/C/src
To run the program
tar -xvf lab1.tar.gz
cd lab1
make
../../bin/linux/release/matrixmul
2) Edit the source files matrixmul.cu and matrixmul_kernel.cu to complete the functionality of the matrix multiplication on the device. The two matrices could be any size, but the resulting matrix is guaranteed to have a number of elements less than 64,000.
3) There are several modes of operation for the application.
No arguments: The application will create two randomly sized and initialized matrices such that the matrix operation M * N is valid, and P is properly sized to hold the result. After the device multiplication is invoked, it will compute the correct solution matrix using the CPU, and compare that solution with the device-computed solution. If it matches (within a certain tolerance), if will print out "Test PASSED" to the screen before exiting.
One argument: The application will use the random initialization to create the input matrices, and write the device-computed output to the file specified by the argument.
Three arguments: The application will read input matrices from provided files. The first argument should be a file containing three integers. The first, second and third integers will be used as M.height, M.width, and N.height. The second and third function arguments will be expected to be files which have exactly enough entries to fill matrices M and N respectively. No output is written to file.
Four arguments: The application will read its inputs from the files provided by the first three arguments as described above, and write its output to the file provided in the fourth.
Note that if you wish to use the output of one run of the application as an input, you must delete the first line in the output file, which displays the accuracy of the values within the file. The value is not relevant for this application.
4) Measure the following cases.
For matrix size 256 vary the block size 8, 16 and measure speedup.
5) Submission:
The lab1.tar.gz file should contain the lab1 folder provided, with all the changes and additions you have made to the source code. Include a pdf file with the answer of question 4.
Instruction:
cd ~/NVIDIA_GPU_Computing_SDK/C/src/
make clean
cd ..
tar cvf lab1.tar lab1
gzip lab1.tar
upload lab1.tar.gz file at T-square
6) Grading
We will grade the functionality of the code with different matrix sizes. We will test arbitary block sizes.
We will have a demo for the grading.
If your code works only for multiple of 16 (default block size), you will receive 60% of the total grade.
(Source: UIUC EE498AL)