2

I have just read a blogpost here and try to do a similar thing, here is my code to check what is in example 1 and 2:

int doSomething(long numLoop,int cacheSize){
    long k;
    int arr[1000000];
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 1000000; i+=cacheSize) arr[i] = arr[i];
    }
}

As stated in the blogpost, the execution time for doSomething(1000,2) and doSomething(1000,1) should be almost the same, but I got 2.1s and 4.3s respectively. Can anyone help me explain? Thank you.

Update 1: I have just increased the size of my array to 100 times larger

int doSomething(long numLoop,int cacheSize){
    long k;
    int * buffer;
    buffer = (int*) malloc (100000000 * sizeof(int));
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 100000000; i+=cacheSize) buffer[i] = buffer[i];
    }
}

Unfortunately, the execution time of doSomething(10,2) and doSomething(10,1) are still much different: 3.02s and 5.65s. Can anyone test this on your machine?

7
  • 1
    Are you using the same exact processor? Different processors have different cache sizes. Commented Sep 27, 2012 at 0:53
  • try making the array bigger and see what happens, in the link, he is using 64MB you are using 1MB you may never get a direct memory access. possibly. Commented Sep 27, 2012 at 0:57
  • hmm, you are correct but I am expecting to see the difference between L1 and L2, L3 cache. Let me check with a larger array. Commented Sep 27, 2012 at 1:18
  • 1
    use valgrind --tool=cachegrind ./your program, check the Dr,Dw, then you will know the cpu cache is the main reason of your difference. Commented Sep 27, 2012 at 1:50
  • 1
    @navie, see this to know more about cache, stackoverflow.com/questions/11413855/… Commented Sep 27, 2012 at 2:41

2 Answers 2

2

Your array size of 4M is not big enough. The entire array fits in the cache (and is in the cache after the first k loop) so the timing is dominated by instruction execution. If you make arr much bigger than the cache size you will start to see the expected effect.

(You will see an additional effect when you make arr bigger than the cache: Runtime should increase linearly with arr size until you exceed the cache, when you will see a knee in performance and it will suddenly get worse and runtime will increase on a new linear scale)

Edit: I tried your second version with the following changes:

  1. Change to volatile int *buffer to ensure buffer[i] = buffer[i] is not optimized away.
  2. Compile with -O2 to ensure the loop is optimized sufficiently to prevent loop overhead from dominating.

When I try that I get almost identical times:

kronos /tmp $ time ./dos 2
./dos 2  1.65s user 0.29s system 99% cpu 1.947 total
kronos /tmp $ time ./dos 1
./dos 1  1.68s user 0.25s system 99% cpu 1.926 total

Here you can see the effects of making the stride two full cachelines:

kronos /tmp $ time ./dos 16
./dos 16  1.65s user 0.28s system 99% cpu 1.926 total
kronos /tmp $ time ./dos 32
./dos 32  1.06s user 0.30s system 99% cpu 1.356 total
Sign up to request clarification or add additional context in comments.

Comments

2

With doSomething(1000, 2) you are doing the innner loop half the count of when you use doSomething(1000,1).

Your inner loop increments by cacheSize so a value of 2 gives you only half the iterations of a value of 1. The array is sized around 4MB which is a typical virtual memory page size.

Actually I am a bit surprised that a good compiler just does not optimize this loop away since there is no variable change. That might be part of the issue.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.