CS304, Fall 2015
Cache Lab: Understanding Cache Memories
Assigned: Tue. 10/20/2015
Due: Tue., 11/3, 11:59PM


1 Logistics

This is an individual project. All handins are electronic. Read code style guideline.

2 Overview

This lab will help you understand the impact that cache memories can have on the performance of your C programs.

The lab consists of two parts. In the first part you will write a small C program (about 200-300 lines) that simulates the behavior of a cache memory. In the second part, you will optimize a small matrix transpose function, with the goal of minimizing the number of cache misses.

3 Downloading the assignment

Download cachelab.tar here.
Start by copying cachelab.tar to a protected Linux directory in which you plan to do your work.

Or you can copy the file from:

    linux> cp  ~liqun/public_html/teaching/cs304_15f/labs/cachelab.tar.gz .

Then give the command

    linux> tar xvf cachelab.tar.gz

This will create a directory called cachelab that contains a number of files. You will be modifying two files: csim.c and trans.c. To compile these files, type:

    linux> make clean  
    linux> make

WARNING: Do not let the Windows WinZip program open up your .tar file (many Web browsers are set to do this automatically). Instead, save the file to your Linux directory and use the Linux tar program to extract the files. In general, for this class you should NEVER use any platform other than Linux to modify your files. Doing so can cause loss of data (and important work!).

4 Description

The lab has two parts. In Part A you will implement a cache simulator. In Part B you will write a matrix transpose function that is optimized for cache performance.

4.1 Reference Trace Files

The traces subdirectory of the handout directory contains a collection of reference trace files that we will use to evaluate the correctness of the cache simulator you write in Part A. The trace files are generated by a Linux program called valgrind. For example, typing

    linux> valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ls -l

on the command line runs the executable program “ls -l”, captures a trace of each of its memory accesses in the order they occur, and prints them on stdout.

Valgrind memory traces have the following form:

I 0400d7d4,8  
 M 0421c7f0,4  
 L 04f6b868,8  
 S 7ff0005c8,8

Each line denotes one or two memory accesses. The format of each line is

[space]operation address,size

The operation field denotes the type of memory access: “I” denotes an instruction load, “L” a data load, “S” a data store, and “M” a data modify (i.e., a data load followed by a data store). There is never a space before each “I”. There is always a space before each “M”, “L”, and “S”. The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

4.2 Part A: Writing a Cache Simulator

In Part A you will write a cache simulator in csim.c that takes a valgrind memory trace as input, simulates the hit/miss behavior of a cache memory on this trace, and outputs the total number of hits, misses, and evictions.

We have provided you with the binary executable of a reference cache simulator, called csim-ref, that simulates the behavior of a cache with arbitrary size and associativity on a valgrind trace file. It uses the LRU (least-recently used) replacement policy when choosing which cache line to evict.

The reference simulator takes the following command-line arguments:

Usage: ./csim-ref [-hv] -s <s> -E <E> -b <b> -t <tracefile>

For example:

    linux> ./csim-ref -s 4 -E 1 -b 4 -t traces/yi.trace  
    hits:4 misses:5 evictions:3

The same example in verbose mode:

    linux> ./csim-ref -v -s 4 -E 1 -b 4 -t traces/yi.trace  
    L 10,1 miss  
    M 20,1 miss hit  
    L 22,1 hit  
    S 18,1 hit  
    L 110,1 miss eviction  
    L 210,1 miss eviction  
    M 12,1 miss eviction hit  
    hits:4 misses:5 evictions:3

Your job for Part A is to fill in the csim.c file so that it takes the same command line arguments and produces the identical output as the reference simulator. Notice that this file is almost completely empty. You’ll need to write it from scratch.

Program outline and useful C functions:

---------------------------
Define your data structure.

Read the command line arguments.

Read the trace file.

Find (tag, set index, block offset).
Use set index to find the right set.
Within the set, compare tag to see if there is a cache line (a cache line is a cache block) matching the tag. Check the valid bit. Determine whether it is cache hit or miss.

IF Cache miss:
              replace an entry using LRU algorithm

Update LRU information.

Return

****
After you finish a step, I suggest you test that step in your program. This way, you will make sure what you have now is correct. If you wait for the end, it will be very hard to debug. For example, after you finish reading comand line arguments, test it. Then go to the component of reading the trace file. Test this component to make sure it is correct. Then go to the next step. 
---------------------------

0. You are suggested to use the following header files:
#include <stdio.h>
#include <getopt.h>
#include <stdlib.h> #include <unistd.h>
#include <string.h> #include <errno.h>

1. Accessing command line arguments:

Here is a nice introduction about how to do it:
http://linuxprograms.wordpress.com/2012/06/22/c-getopt-example/

If you do use getopt(), you should add the following line in the beginning of your program.
#include <getopt.h>

If you use atoi() which convert a string to an integer, you should add the following line:
#include <stdlib.h>

You may also need to use string copy function:
#include <string.h>
char *strcpy(char *dest, const char *src)
strcpy(dest, src) copies a string from src to dest.

See string functions: 
http://icecube.wisc.edu/~dglo/c_class/strmemfunc.html


2. malloc and free:


Suppose you want to allocate memory for your cache. You may first declare a struct cache_line. Each cache line should include the following information: tag, valid bit, LRU counter

LRU counter can be updated every time you have a memory access. Use that LRU counter as the access time for each cache line.

You can use the following to allocate a 2d memory layout:

    struct cache_line** cache = (struct cache_line**) malloc(sizeof(struct cache_line*) * S);

    for (i=0; i<S; i++){
        cache[i]=(struct cache_line*) malloc(sizeof(struct cache_line) * E);
        for (j=0; j<E; j++){
            cache[i][j].valid = 0;
            cache[i][j].tag = 0;
            cache[i][j].lru = 0;
        }
    }

You can also directly allocate space for all cache blocks by using:

    struct cache_line* cache = (struct cache_line *) malloc(sizeof(struct cache_line)*S*E);

    for (i=0; i<S; i++){
        cache[i]=(struct cache_line*) malloc(sizeof(struct cache_line) * E);
        for (j=0; j<E; j++){
            cache[i*sizeof(struct cache_line)*E+j].valid = 0;
            cache[i*sizeof(struct cache_line)*E+j].tag = 0;
            cache[i*sizeof(struct cache_line)*E+j].lru = 0;
        }
    }

You need to free the memory in the end by calling free(). All memory allocated through malloc() should be freed. To use malloc() and free(), you should add the following line in the beginning of your program: #include <stdlib.h>


3. Reading from a file: You can use the following to read a trace file. See here for more on file operations: http://icecube.wisc.edu/~dglo/c_class/stdio.html

/*
Read a file with filename in string at address file_name
*/
void readFile(char* file_name)
{
    char buf[1000];
    FILE* fp = fopen(file_name, "r");

    /*File opened correctly?*/
    if(!fp){
        fprintf(stderr, "%s: %s\n", file_name, strerror(errno));
        exit(1);
    }

    /*read a line from the file*/
    while( fgets(buf, 1000, fp) != NULL) {

    /*here you may use sscanf().
	int sscanf(const char *str, const char *format, ...)
sscanf() is similar to scanf(). The difference is that sscanf() will read from a string instead of from standard input.
In the format part, use %c to read a character, %x to read a hexadecimal int, %s to read a string, %d to read a decimal int,
"%x,%d" will read a hexadecimal int and a comma, and a decimal int
Make sure to use pointer instead of a variable in sscanf()


    If the address is more than 32 bits, that is, 64 bits, you can declare your address as unsigned long long int, which is 64 bits on x86-64.  

    Use %llx to get the unsigned long long hexadecimal int.

        unsigned long long int address;
          scanf("%llx", &address);

*/

    }

    /*close the file*/
    fclose(fp);
}

4. Format for printf()

You can use gdb to debug your program. Try GDB reference card or tutorial1, tutorial 2, tutorial 3.

If you use printf(), the following format will print out the pointer of p, i in hexadecimal form, j in decimal form, c as a character, and string str.

int  i=1, 2, *p=&i;
char c='A', str[10]="abcde";
printf("%p %x %d %c %s\n", p, i, j, c, str);


Programming Rules for Part A

4.3 Part B: Optimizing Matrix Transpose

In Part B you will write a transpose function in trans.c that causes as few cache misses as possible.

Let A denote a matrix, and Aij denote the component on the ith row and jth column. The transpose of A, denoted AT , is a matrix such that Aij = AjiT .

To help you get started, we have given you an example transpose function in trans.c that computes the transpose of N × M matrix A and stores the results in M × N matrix B:

    char trans_desc[] = "Simple row-wise scan transpose";  
    void trans(int M, int N, int A[N][M], int B[M][N])

The example transpose function is correct, but it is inefficient because the access pattern results in relatively many cache misses.

Your job in Part B is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different sized matrices:

    char transpose_submit_desc[] = "Transpose submission";  
    void transpose_submit(int M, int N, int A[N][M], int B[M][N]);

Do not change the description string (“Transpose submission”) for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Cache Blocking

For simple transpose, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.




In the above image, each block Aij of matrix A is transposed into its final location in the output matrix. With this scheme, we significantly reduce the magnitude of the working set in cache at any one point in time. This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition.

Programming Rules for Part B

5 Evaluation

This section describes how your work will be evaluated. The full score for this lab is 60 points:

5.1 Evaluation for Part A

For Part A, we will run your cache simulator using different cache parameters and traces. There are eight test cases, each worth 3 points, except for the last case, which is worth 6 points:

  linux> ./csim -s 1 -E 1 -b 1 -t traces/yi2.trace  
  linux> ./csim -s 4 -E 2 -b 4 -t traces/yi.trace  
  linux> ./csim -s 2 -E 1 -b 4 -t traces/dave.trace  
  linux> ./csim -s 2 -E 1 -b 3 -t traces/trans.trace  
  linux> ./csim -s 2 -E 2 -b 3 -t traces/trans.trace  
  linux> ./csim -s 2 -E 4 -b 3 -t traces/trans.trace  
  linux> ./csim -s 5 -E 1 -b 5 -t traces/trans.trace  
  linux> ./csim -s 5 -E 1 -b 5 -t traces/long.trace

You can use the reference simulator csim-ref to obtain the correct answer for each of these test cases. During debugging, use the -v option for a detailed record of each hit and miss.

For each test case, outputting the correct number of cache hits, misses and evictions will give you full credit for that test case. Each of your reported number of hits, misses and evictions is worth 1/3 of the credit for that test case. That is, if a particular test case is worth 3 points, and your simulator outputs the correct number of hits and misses, but reports the wrong number of evictions, then you will earn 2 points.

5.2 Evaluation for Part B

For Part B, we will evaluate the correctness and performance of your transpose_submit function on three different-sized output matrices:

5.2.1 Performance (26 pts)

For each matrix size, the performance of your transpose_submit function is evaluated by using valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters (s = 5, E = 1, b = 5).

Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:

Your code must be correct to receive any performance points for a particular size. Your code only needs to be correct for these three cases and you can optimize it specifically for these three cases. In particular, it is perfectly OK for your function to explicitly check for the input sizes and implement separate code optimized for each case.

5.3 Evaluation for Style

There are 7 points for coding style. These will be assigned manually by the course staff. Style guidelines can be found on the page of code style guideline.

The course staff will inspect your code in Part B for illegal arrays and excessive local variables.

6 Working on the Lab

6.1 Working on Part A

We have provided you with an autograding program, called test-csim, that tests the correctness of your cache simulator on the reference traces. Be sure to compile your simulator before running the test:

linux> make  
linux> ./test-csim  
                        Your simulator     Reference simulator  
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts  
     3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace  
     3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace  
     3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace  
     3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace  
     3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace  
     3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace  
     3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace  
     6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace  
    27

For each test, it shows the number of points you earned, the cache parameters, the input trace file, and a comparison of the results from your simulator and the reference simulator.

Here are some hints and suggestions for working on Part A:

6.2 Working on Part B

We have provided you with an autograding program, called test-trans.c, that tests the correctness and performance of each of the transpose functions that you have registered with the autograder.

You can register up to 100 versions of the transpose function in your trans.c file. Each transpose version has the following form:

    /⋆ Header comment ⋆/  
    char trans_simple_desc[] = "A simple transpose";  
    void trans_simple(int M, int N, int A[N][M], int B[M][N])  
    {  
        /⋆ your transpose code here ⋆/  
    }

Register a particular transpose function with the autograder by making a call of the form:

    registerTransFunction(trans_simple, trans_simple_desc);

in the registerFunctions routine in trans.c. At runtime, the autograder will evaluate each registered transpose function and print the results. Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

    registerTransFunction(transpose_submit, transpose_submit_desc);

See the default trans.c function for an example of how this works.

The autograder takes the matrix size as input. It uses valgrind to generate a trace of each registered transpose function. It then evaluates each trace by running the reference simulator on a cache with parameters (s = 5, E = 1, b = 5).

For example, to test your registered transpose functions on a 32 × 32 matrix, rebuild test-trans, and then run it with the appropriate values for M and N:

linux> make  
linux> ./test-trans -M 32 -N 32  
Step 1: Evaluating registered transpose funcs for correctness:  
func 0 (Transpose submission): correctness: 1  
func 1 (Simple row-wise scan transpose): correctness: 1  
func 2 (column-wise scan transpose): correctness: 1  
func 3 (using a zig-zag access pattern): correctness: 1  
 
Step 2: Generating memory traces for registered transpose funcs.  
 
Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)  
func 0 (Transpose submission): hits:1766, misses:287, evictions:255  
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151  
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151  
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945  
 
Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions in trans.c. The test-trans program tests each of the registered functions, displays the results for each, and extracts the results for the official submission.

Here are some hints and suggestions for working on Part B.

6.3 Putting it all Together

We have provided you with a driver program, called ./driver.py, that performs a complete evaluation of your simulator and transpose code. This is the same program your instructor uses to evaluate your handins. The driver uses test-csim to evaluate your simulator, and it uses test-trans to evaluate your submitted transpose function on the three matrix sizes. Then it prints a summary of your results and the points you have earned.

To run the driver, type:

    linux> ./driver.py

7 Handing in Your Work

Each time you type make in the cachelab directory, the Makefile creates a tarball, called userid-handin.tar, that contains your current csim.c and trans.c files.

Submit your work: To submit, read the README file, or
To submit, run:
    linux> make submit

or use:

linux> ~fluo/bin/submit cs304 lab4 handin.tar

Before submit your project, please compile your code
    linux> make
Thus your handin.tar will be the latest.

IMPORTANT: Do not create the handin tarball on a Windows or Mac machine, and do not handin files in any other archive format, such as .zip, .gzip, or .tgz files.

1The reason for this restriction is that our testing code is not able to count references to the stack. We want you to limit your references to the stack and focus on the access patterns of the source and destination arrays.

2Because valgrind introduces many stack accesses that have nothing to do with your code, we have filtered out all stack accesses from the trace. This is why we have banned local arrays and placed limits on the number of local variables.