This assignment covers caching and threads. The last problem is a programming assignment that serves as a warm-up to Project 4.
Problems
Files:
Help for using pthreads:
man -k pthread
on turku lists all the
pthread man page entries. Of interest to you, will be
man pthread_mutex_init
and man pthread_cond_init
.A. [11 points] For this question, we will be using an architecture with the following attributes:
The TLB, a partial page table, a few virtual addresses are shown below. Use this information to answer questions i through iv.
TLB: /------------------------------------------\ | Valid | Dirty | Page | Frame | |-------|-------|--------------|-----------| | 0 | 0 | 00 0000 0000 | 0001 0000 | | 0 | 0 | 00 0000 0101 | 0000 1110 | | 0 | 1 | 00 0001 0110 | 0000 0000 | | 0 | 1 | 00 0001 1011 | 0000 0000 | | 1 | 0 | 11 1010 0100 | 1000 0100 | | 1 | 0 | 11 0100 0101 | 0100 0101 | | 1 | 1 | 10 0010 1010 | 1100 0110 | | 1 | 0 | 10 0000 1111 | 1000 1001 | \------------------------------------------/ Page Table (partial): /------------------------------------------\ | ADDRESS | VALID | CONTENTS | |--------------|-------|-------------------| | 00 0000 0000 | 0 | 0001 0000 | | 00 0000 0001 | 1 | 0011 0011 | | 00 0000 0010 | 1 | 0000 0000 | | 00 0000 0011 | 0 | 0010 0100 | | 00 0000 0100 | 1 | 0001 1000 | | 00 0000 0101 | 0 | 0000 1110 | | 00 0000 0110 | 1 | 1100 0011 | | 00 0000 0111 | 1 | 0101 0110 | | .... ... ... | | 11 1111 1110 | 0 | 1010 0101 | | 11 1111 1111 | 0 | 1111 1111 | \------------------------------------------/ Virtual Addresses: /----------------------------\ | A | 00 0000 0001 1001 1001 | | B | 00 0000 0101 0110 0110 | | C | 11 1111 1111 1111 1111 | | D | 00 0000 0111 0000 1111 | | E | 00 0000 0110 1100 0101 | \----------------------------/
B. [12 points] Consider the following page-reference string: 3, 1, 4, 5, 3, 4, 1, 5, 3, 2, 4, 5, 2, 4, 1, 3, 2, 5
How many page faults would occur for the following replacement algorithms, assuming 4 frames? Remember that all frames are initially empty, so your first unique pages will all cost one fault each. You should assume that this pattern repeats endlessly (Hint: May be important for the optimal replacement algorithm.)
Consider the sequence of data references found in hw4_answers.txt made to a cache by some program. Each reference is a read of a 4-byte integer value and is described by the byte address of that integer.
A. [10 points] Assuming a 1KB, 16B block, direct-mapped cache, initially empty, fill in whether each reference is a hit or a miss. Also, fill in the long-term hit rate as a percentage.
B. [5 points] Suppose the cache is changed to be 2-way set associative (LRU replacement) but otherwise has same set of parameters. Fill in the hits and misses. What is the long-term hit rate for the 2-way set-associative cache?
C. [10 points] Set-associative caches generally have better hit rates than direct-mapped caches for a given size. However, it is possible to find counter examples. Construct a repeating sequence of references such that the 1KB direct cache described in part A achieves a better hit rate than the 2-way set-associative cache described in part B.
A. [10 points] What is the total number of bits (overhead and data) required for this particular cache configuration: 1 MB total data, 16-way set associative, 512 Byte blocks. Assume a "write-back" write strategy and a "FIFO" replacement strategy. Assume a 32-bit, byte-addressed architecture.
EMAT = Time for a hit + (Miss rate x Miss penalty)
A. [10 points] Find the EMAT for a machine with a 1-ns clock, a miss penalty of 40 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls.
B. [5 points] Suppose we can improve the miss rate to 0.03 misses per reference by doubling the cache size. This causes the cache access time to increase to 2 clock cycles. Using the EMAT as a metric, determine if this is a good trade-off. Please show your work.
C. [5 points] Generally speaking, the CPU cycle time is matched to the cache access time in a pipelined processor. Let us consider two machines that have identical instruction sets and pipeline structure. They differ only in the clock speeds of the processor and the cache structure.
Machine A:
CPU clock cycle time = 1 ns
Cache access time = 1 CPU cycle
Cache miss rate = 5%
Cache miss penalty = 60 CPU cycles
Machine B:
CPU Clock cycle time = 2 ns
Cache access time = 1 CPU cycle
Cache miss rate = 3%
Cache miss penalty = 30 CPU cycles
Both machines have a CPI of 3 without accounting for memory stalls. Both incur 1.45 memory references on an average per instruction.
(a) Which processor has a better EMAT?
(b) Is the EMAT sufficient to declare one machine to be better than the other? Why not?
(c) Which machine is actually better [Hint: Refer to Section 9.9.1 of the textbook]
D. [5 points] Consider the following memory hierarchy: - A 64 entry fully associative TLB split into 2 halves; one-half for user processes and the other half for the kernel. The TLB has an access time of 1 cycle. The hit rate for the TLB is 95%. A miss results in a main memory access to complete the address translation. - An L1 cache with a 1 cycle access time, and 99% hit rate. - An L2 cache with a 5 cycle access time, and a 90% hit rate. - An L3 cache with a 20 cycle access time, and a 80% hit rate. - A physical memory with a 100 cycle access time. Compute the effective memory access time (EMAT) for this memory hierarchy. Note that the page table entry may itself be in the cache.
This problem has you solve the classic "bounded buffer" problem with one producer and multiple consumer threads.
The program takes the number of consumers as an argument (defaulting to 1) and a sequence of numbers from stdin. We give you a couple of test sequences: shortlist and longlist. For more explanation of how this works, see the comment at the top of hw4.c
The producer thread reads the sequence of numbers and feeds that to the
consumers. Consumers pick up a number, do some "work" with the
number, then go back for another number.
The program as provided includes output from the producer and consumers. For
reference, a working version of the code with a bounded buffer of size 10
running on shortlist with four consumers produces this
output (the comments on the right are added): (NOTE: Your output may not match
what is shown identically due to the randomness of thread scheduling. However,
your output should show all entries being produced in the correct order and
consumed in the correct order).
turku% ./hw4 4 < shortlist main: nconsumers = 4 consumer 0: starting consumer 1: starting consumer 2: starting consumer 3: starting producer: starting producer: 1 producer: 2 producer: 3 producer: 4 producer: 5 producer: 6 producer: 7 producer: 8 producer: 9 producer: 10 producer: 9 producer: 8 producer: 7 producer: 6 consumer 0: 1 producer: 5 consumer 1: 2 producer: 4 consumer 2: 3 producer: 3 consumer 3: 4 producer: 2 consumer 0: 5 producer: 1 consumer 1: 6 producer: read EOF, sending 4 '-1' numbers consumer 2: 7 consumer 3: 8 consumer 0: 9 consumer 1: 10 producer: exiting consumer 2: 9 consumer 3: 8 consumer 0: 7 consumer 1: 6 consumer 2: 5 consumer 3: 4 consumer 3: exiting consumer 0: 3 consumer 0: exiting consumer 2: 1 consumer 2: exiting consumer 1: 2 consumer 1: exiting
A. [40 points] Finish the bounded-buffer code in hw4.c, adding synchronization so that the multiple threads can access the buffer simultaneously.
pthread_cond_wait()
to wait when
the buffer is empty or full. B. [0 points] Testing suggestions:
/bin/time
.
When running with longlist, doubling the number of
consumers should roughly halve the execution time. What is the minimum
possible execution time? End of CS 2200 Homework 4