0

I want to study the effect of the cache size on code. For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.

How can I meassure this?

I tried to run this c program:

#define L1_CACHE_SIZE       32      //  Kbytes  8192 integers
#define L2_CACHE_SIZE       256     //  Kbytes  65536 integers
#define L3_CACHE_SIZE       4096    //  Kbytes

#define ARRAYSIZE 32000
#define ITERATIONS 250

int arr[ARRAYSIZE];

/*************** TIME MEASSUREMENTS ***************/

double microsecs() {
    struct timeval t;
    if (gettimeofday(&t, NULL) < 0 )
    return 0.0;
    return (t.tv_usec + t.tv_sec * 1000000.0);    
}

void init_array() {
    int i;
    for (i = 0; i < ARRAYSIZE; i++) {
        arr[i] = (rand() % 100);
    }
}

int operation() {
    int i, j;
    int sum = 0;
    for (j = 0; j < ITERATIONS; j++) {
        for (i = 0; i < ARRAYSIZE; i++) {
            sum =+ arr[i];
        }
    }

    return sum;
}

void main() {

    init_array();

    double t1 = microsecs();
    int result = operation();
    double t2 = microsecs();

    double t = t2 - t1;

    printf("CPU time %f milliseconds\n", t/1000);
    printf("Result: %d\n", result);    
}

taking values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant) in order to check if the program run faster if the array fits in the cache, but I always get the same CPU time.

Can anyone say what I am doing wrong?

1
  • Did you check whether it actually does anything? Commented Jul 7, 2016 at 17:26

3 Answers 3

1

What you really want to do is build a "memory mountain." A memory mountain helps you visualize how memory accesses affect program performance. Specifically, it measures read throughput vs spatial locality and temporal locality. Good spatial locality means that consecutive memory accesses are near each other and good temporal locality means that a certain memory location is accessed multiple times in a short amount of program time. Here is a link that briefly mentions cache performance and memory mountains. The 3rd edition of the textbook mentioned in that link is a very good reference, specifically chapter 6, for learning about memory and cache performance. (In fact, I'm currently using that section as a reference as I answer this question.)

Another link shows a test function that you could use to measure cache performance, which I have copied here:

void test(int elems, int stride)
{
    int i, result = 0;
    volatile int sink;
    for (i = 0; i < elems; i+=stride)
        result += data[i];
    sink = result;
}

Stride is the temporal locality - how far apart the memory accesses are. The idea is that this function would estimate the number of cycles that it took to run. To get throughput, you'll want to take (size / stride) / (cycles / MHz), where size is the size of the array in bytes, cycles is the result of this function, and MHz is the clock speed of your processor. You'd want to call this once before you take any measurements to "warm up" your cache. Then, run the loop and take measurements.

I found a GitHub repository that you could use to build a 3D memory mountain on your own machine. I encourage you to try it on multiple machines with different processors and compare differences.

Sign up to request clarification or add additional context in comments.

Comments

0

There's a typo in your code. =+ instead of +=.

1 Comment

That's the old style version of += but it's still the same operation [deprecated because of the ambiguity between =+ and = +]. So, in modern terms, a typo, but, functionally equivalent, so not the source of the problem.
0

The arr array is linked into the BSS [uninitialized] section. The default value for the variables in this section is zero. All pages in this section are initially mapped R/O to a single zero page. This is linux/Unix centric, but, probably applies to most modern OSes

So, regardless of the array size, you're only fetching from a single page, which will get cached, so that's why you get the same results.

You'll need to break the "zero page mapping" by writing something to all of arr before doing your tests. That is, do something like memset first. This will cause the OS to create a linear page mapping for arr using its COW (copy-on-write) mechanism.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.