CS 2200 Intro to Systems and Networks
Homework 4

This assignment covers caching and threads. The last problem is a programming assignment that serves as a warm-up to Project 4.

Problems

Memory Access
Cache Simulation
Cache Organization
Cache Timing
Producer-Consumer(Threads)

Files:

Help for using pthreads:

man -k pthread on turku lists all the pthread man page entries. Of interest to you, will be man pthread_mutex_init and man pthread_cond_init.
http://www.llnl.gov/computing/tutorials/pthreads/
http://www.humanfactor.com/pthreads/pthread-tutorials.html
http://en.wikipedia.org/wiki/Pthreads

Problem 0: Memory Access

A. [11 points] For this question, we will be using an architecture with the following attributes:

18-bit virtual address
16-bit physical address
8-entry, fully-associative TLB
150 us (10^-6 s) TLB access time
5 ms (10^-3 s) memory access time

The TLB, a partial page table, a few virtual addresses are shown below. Use this information to answer questions i through iv.

TLB:

/------------------------------------------\

| Valid | Dirty | Page         | Frame     |

|-------|-------|--------------|-----------|

|   0   |   0   | 00 0000 0000 | 0001 0000 |

|   0   |   0   | 00 0000 0101 | 0000 1110 |

|   0   |   1   | 00 0001 0110 | 0000 0000 |

|   0   |   1   | 00 0001 1011 | 0000 0000 |

|   1   |   0   | 11 1010 0100 | 1000 0100 |

|   1   |   0   | 11 0100 0101 | 0100 0101 |

|   1   |   1   | 10 0010 1010 | 1100 0110 |

|   1   |   0   | 10 0000 1111 | 1000 1001 |

\------------------------------------------/



Page Table (partial):

/------------------------------------------\

| ADDRESS      | VALID | CONTENTS          |

|--------------|-------|-------------------|

| 00 0000 0000 |   0   | 0001 0000         |

| 00 0000 0001 |   1   | 0011 0011         |

| 00 0000 0010 |   1   | 0000 0000         |

| 00 0000 0011 |   0   | 0010 0100         |

| 00 0000 0100 |   1   | 0001 1000         |

| 00 0000 0101 |   0   | 0000 1110         |

| 00 0000 0110 |   1   | 1100 0011         |

| 00 0000 0111 |   1   | 0101 0110         |

|    ....         ...       ...            |

| 11 1111 1110 |   0   | 1010 0101         |

| 11 1111 1111 |   0   | 1111 1111         |

\------------------------------------------/



Virtual Addresses:

/----------------------------\

| A | 00 0000 0001 1001 1001 |

| B | 00 0000 0101 0110 0110 |

| C | 11 1111 1111 1111 1111 |

| D | 00 0000 0111 0000 1111 |

| E | 00 0000 0110 1100 0101 |

\----------------------------/

Which virtual address(es) cause page fault(s)?
How many memory accesses in total were required to obtain the contents of virtual address B?
What is the total time required to access the contents at virtual address D (i.e. effective access time)?
For virtual address A, what physical address is generated?

B. [12 points] Consider the following page-reference string: 3, 1, 4, 5, 3, 4, 1, 5, 3, 2, 4, 5, 2, 4, 1, 3, 2, 5

How many page faults would occur for the following replacement algorithms, assuming 4 frames? Remember that all frames are initially empty, so your first unique pages will all cost one fault each. You should assume that this pattern repeats endlessly (Hint: May be important for the optimal replacement algorithm.)

LRU replacement
FIFO replacement
Optimal replacement

Problem 1: Cache Simulation [25 points]

Consider the sequence of data references found in hw4_answers.txt made to a cache by some program. Each reference is a read of a 4-byte integer value and is described by the byte address of that integer.

A. [10 points] Assuming a 1KB, 16B block, direct-mapped cache, initially empty, fill in whether each reference is a hit or a miss. Also, fill in the long-term hit rate as a percentage.

B. [5 points] Suppose the cache is changed to be 2-way set associative (LRU replacement) but otherwise has same set of parameters. Fill in the hits and misses. What is the long-term hit rate for the 2-way set-associative cache?

C. [10 points] Set-associative caches generally have better hit rates than direct-mapped caches for a given size. However, it is possible to find counter examples. Construct a repeating sequence of references such that the 1KB direct cache described in part A achieves a better hit rate than the 2-way set-associative cache described in part B.

Problem 2: Cache Organization [10 points]

A. [10 points] What is the total number of bits (overhead and data) required for this particular cache configuration: 1 MB total data, 16-way set associative, 512 Byte blocks. Assume a "write-back" write strategy and a "FIFO" replacement strategy. Assume a 32-bit, byte-addressed architecture.

Problem 3: Cache Timing [25 points]

EMAT = Time for a hit + (Miss rate x Miss penalty)

A. [10 points] Find the EMAT for a machine with a 1-ns clock, a miss penalty of 40 clock cycles, a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls.

B. [5 points] Suppose we can improve the miss rate to 0.03 misses per reference by doubling the cache size. This causes the cache access time to increase to 2 clock cycles. Using the EMAT as a metric, determine if this is a good trade-off. Please show your work.

C. [5 points] Generally speaking, the CPU cycle time is matched to the cache access time in a pipelined processor. Let us consider two machines that have identical instruction sets and pipeline structure. They differ only in the clock speeds of the processor and the cache structure.

Machine A:
CPU clock cycle time = 1 ns
Cache access time = 1 CPU cycle
Cache miss rate = 5%
Cache miss penalty = 60 CPU cycles

Machine B:
CPU Clock cycle time = 2 ns
Cache access time = 1 CPU cycle
Cache miss rate = 3%
Cache miss penalty = 30 CPU cycles

Both machines have a CPI of 3 without accounting for memory stalls. Both incur 1.45 memory references on an average per instruction.
(a) Which processor has a better EMAT?
(b) Is the EMAT sufficient to declare one machine to be better than the other? Why not?
(c) Which machine is actually better [Hint: Refer to Section 9.9.1 of the textbook]

D. [5 points] Consider the following memory hierarchy: - A 64 entry fully associative TLB split into 2 halves; one-half for user processes and the other half for the kernel. The TLB has an access time of 1 cycle. The hit rate for the TLB is 95%. A miss results in a main memory access to complete the address translation. - An L1 cache with a 1 cycle access time, and 99% hit rate. - An L2 cache with a 5 cycle access time, and a 90% hit rate. - An L3 cache with a 20 cycle access time, and a 80% hit rate. - A physical memory with a 100 cycle access time. Compute the effective memory access time (EMAT) for this memory hierarchy. Note that the page table entry may itself be in the cache.

Problem 4: Producer-Consumer [40 points]

This problem has you solve the classic "bounded buffer" problem with one producer and multiple consumer threads.

The program takes the number of consumers as an argument (defaulting to 1) and a sequence of numbers from stdin. We give you a couple of test sequences: shortlist and longlist. For more explanation of how this works, see the comment at the top of hw4.c

The producer thread reads the sequence of numbers and feeds that to the consumers. Consumers pick up a number, do some "work" with the number, then go back for another number.

The program as provided includes output from the producer and consumers. For reference, a working version of the code with a bounded buffer of size 10 running on shortlist with four consumers produces this output (the comments on the right are added): (NOTE: Your output may not match what is shown identically due to the randomness of thread scheduling. However, your output should show all entries being produced in the correct order and consumed in the correct order).

turku% ./hw4 4 < shortlist

main: nconsumers = 4

  consumer 0: starting

  consumer 1: starting

  consumer 2: starting

  consumer 3: starting

  producer: starting

producer: 1

producer: 2

producer: 3

producer: 4

producer: 5

producer: 6

producer: 7

producer: 8

producer: 9

producer: 10

producer: 9

producer: 8

producer: 7

producer: 6

  consumer 0: 1

producer: 5

  consumer 1: 2

producer: 4

  consumer 2: 3

producer: 3

  consumer 3: 4

producer: 2

  consumer 0: 5

producer: 1

  consumer 1: 6

producer: read EOF, sending 4 '-1' numbers

  consumer 2: 7

  consumer 3: 8

  consumer 0: 9

  consumer 1: 10

producer: exiting

  consumer 2: 9

  consumer 3: 8

  consumer 0: 7

  consumer 1: 6

  consumer 2: 5

  consumer 3: 4

  consumer 3: exiting

  consumer 0: 3

  consumer 0: exiting

  consumer 2: 1

  consumer 2: exiting

  consumer 1: 2

  consumer 1: exiting

A. [40 points] Finish the bounded-buffer code in hw4.c, adding synchronization so that the multiple threads can access the buffer simultaneously.

There are really two problems here: managing the bounded buffer and synchronizing it. We suggest to write and test your bounded buffer implementation first before implementing synchronization.
Your implementation must not spin-wait. There are several possible strategies. An excellent strategy with pthreads is to use a mutex and condition variables, i.e. using pthread_cond_wait() to wait when the buffer is empty or full.

B. [0 points] Testing suggestions:

You should be able to reproduce the output above.
Try measuring the execution time with /bin/time. When running with longlist, doubling the number of consumers should roughly halve the execution time. What is the minimum possible execution time?