0

I am using Visual Studio 2012 and building on x64 release mode. The following code takes 33.5% of the time my program needs to run. I used the visual studio profiler to measure it.

    //every variable is unsigned int or unsigned int*

    for(unsigned int i = 0; i < num; i++)
    {
        unique[ids[i]]++;//2.1%
        total[ids[i]] += main_list[id];//31.4%
    }

Can someone please recommended a way to decrease the time this function takes to run?

edit: Based on your input I tried the following code:

    const unsigned int now = main_list[id];

    for(unsigned int i = ids[0], j = 0; j < num; j++)
    {
        ++unique[i];//2.0%
        total[i] += now;//16.7%
        i = ids[j];//16.8%
    }

This confirms the theory that probably CPU branch prediction fails because the positions are random (btw they are not completely random, but sorted). Any idea if it's possible to speed up my code, please?

2nd edit: I tried the following:

    const unsigned int now = main_list[id];

    for(unsigned int i = ids[0], j = 0; j < num; j++)
    {
        total[i] += now;//2.0%
        ++unique[i];//16.7%
        i = ids[j];//16.8%
    }

The above test should make it clear what is going on.

20
  • 1
    What are you trying to solve? It could be you are not using the correct algorithm. Just looking at this is not enough information. Commented Sep 2, 2015 at 19:59
  • @NathanOliver: I will make another post for the algorithm, but right now this is slowing down my app. Commented Sep 2, 2015 at 19:59
  • 2
    unique[ids[i]] and total[ids[i]] are probably killing a lot of your performance because indirect array references limit the ability of the compiler to do software prefetching. Also, if you could iterate directly through total and unique you could easily use an openmp for loop with SSE intrinsics to do your increment and add/save operations 4 at a time per thread. Commented Sep 2, 2015 at 20:15
  • 1
    @We'reAllMadHere it sounds like the ids must be sparsely distributed then. You may well be better off using a hash table rather than an array here - even if your ~100 million num elements all have unique ids you still only use ~2.5% of your unique and total array slots. If the ids are spread out randomly across the range, you'll be accessing a handful of bytes per memory page. Commented Sep 2, 2015 at 20:49
  • 1
    "... This confirms the theory that probably CPU branch prediction fails ..." Uh, wat? Nobody had that theory, and there are no branches in this code that could fail prediction. Commented Sep 2, 2015 at 20:56

3 Answers 3

3

You're not getting any locality friendliness with your code. I'd throw out two possible ideas.

  1. Group unique and total together.

    struct Stuff {
        unsigned int unique, total;
    };
    
    for(unsigned int i = 0; i < num; i++)
    {
        Stuff& s = stuffs[ids[i]];
        s.unique++;
        s.total += main_list[id]; // <== is this supposed to be ids[i]?
    }
    

That will make sure the things you're accessing in memory consecutively are actually next to each other in memory. As-is, assuming num is sufficiently large, you're cache missing on every line. That is about as bad as you can get.

  1. Sort ids. Right now, you're still bouncing around in memory. Let's make sure we can actually go sequentially:

    std::sort(ids, ids + num);
    // rest of loop as before
    

This way, it's likely that stuffs[ids[i+1]] will be prefetched while you're processing stuffs[ids[i]]. That'll save you a lot of lookup time as well.

Sign up to request clarification or add additional context in comments.

5 Comments

Basically what you should be trying to do is keep objects that are used 'together' close together in memory as well.
@Tive Yes that is literally what I said.
@Barry: I'm wording your point differently, to maybe make it clearer what you mean. OTOH, I too, forgot to mention "locality of reference", which has better 'google-ability'....
Loop unrolling is another available technique. For example, double the statements inside the loop and increment the loop counter by 2. (Remember to use i+1 in the second set).
I don't think that sorting ids[] will help the prefetcher. A prefetcher sees p,p+4,p+8 and predicts p+12 or p+2,p+4,p+6 to predict p+8. As ids is non-contiguous, you're unlikely to have 4 id values in perfect order.
2

You may be getting hit by aliasing preventing the compiler from optimizing your loop here due to it having to allow for the possibility that unique, total and main_list overlap in memory. This may perform better:

const auto mainListId = main_list[id];
for (unsigned int i = 0; i < num; ++i) {
    const auto currId = ids[i];
    ++unique[currId];
    total[currId] += mainListId;
}

Assuming of course that there is not in fact any aliasing going on.

There's not a whole lot more you can do with such a simple loop. You can make sure your compiler optimization settings are set to maximum and you could try unrolling the loop some if the compiler is not doing it for you. Beyond that you would likely need to make algorithmic improvements beyond the scope of the code you show here.

You are likely memory bound due to the non sequential memory accesses resulting from the ordering of ids. This could perhaps be addressed by sorting your ids array before this loop but without more context for what you're trying to do it's hard to say if that makes sense.

1 Comment

Thanks for the help, but this doesn't make any noticeable change. It's around 31.5-32%
1

I'm surprised at the i = ids[j]; //16.8% - that should be faster. It looks like the timing is off. ++unique[i]; //2.0% is a non-linear (non-prefetchable) access and should be slower, not 8 times faster. In fact, ids[] should be in cache so you'd have only 1 in 8 accesses hit main memory. The statement should be 8 times faster. Are you sure you got the right times for the right operations?

That said, you should parallelize the loop. It won't help a lot; main memory won't become faster. But you should keep main memory busy. The idea of the CPU prefetcher is to throw in a few predicted accesses if there are no explicit accesses. If the prediction is right, it saves time, and otherwise it just wastes a bit of energy.

Parallelizing the loop is possible because ids[] is sorted. Even if there are duplicate values, they're adjacent, so you can find split points by finding the first occurrence of a repeated value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.