C++ optimize simple loop

Question

I am using Visual Studio 2012 and building on x64 release mode. The following code takes 33.5% of the time my program needs to run. I used the visual studio profiler to measure it.

    //every variable is unsigned int or unsigned int*

    for(unsigned int i = 0; i < num; i++)
    {
        unique[ids[i]]++;//2.1%
        total[ids[i]] += main_list[id];//31.4%
    }

Can someone please recommended a way to decrease the time this function takes to run?

edit: Based on your input I tried the following code:

    const unsigned int now = main_list[id];

    for(unsigned int i = ids[0], j = 0; j < num; j++)
    {
        ++unique[i];//2.0%
        total[i] += now;//16.7%
        i = ids[j];//16.8%
    }

This confirms the theory that probably CPU branch prediction fails because the positions are random (btw they are not completely random, but sorted). Any idea if it's possible to speed up my code, please?

2nd edit: I tried the following:

    const unsigned int now = main_list[id];

    for(unsigned int i = ids[0], j = 0; j < num; j++)
    {
        total[i] += now;//2.0%
        ++unique[i];//16.7%
        i = ids[j];//16.8%
    }

The above test should make it clear what is going on.

What are you trying to solve? It could be you are not using the correct algorithm. Just looking at this is not enough information. — NathanOliver
– NathanOliver, Commented Sep 2, 2015 at 19:59
@NathanOliver: I will make another post for the algorithm, but right now this is slowing down my app. — We're All Mad Here
– We're All Mad Here, Commented Sep 2, 2015 at 19:59
unique[ids[i]] and total[ids[i]] are probably killing a lot of your performance because indirect array references limit the ability of the compiler to do software prefetching. Also, if you could iterate directly through total and unique you could easily use an openmp for loop with SSE intrinsics to do your increment and add/save operations 4 at a time per thread. — RyanP
– RyanP, Commented Sep 2, 2015 at 20:15
@We'reAllMadHere it sounds like the ids must be sparsely distributed then. You may well be better off using a hash table rather than an array here - even if your ~100 million num elements all have unique ids you still only use ~2.5% of your unique and total array slots. If the ids are spread out randomly across the range, you'll be accessing a handful of bytes per memory page. — mattnewport
– mattnewport, Commented Sep 2, 2015 at 20:49
"... This confirms the theory that probably CPU branch prediction fails ..." Uh, wat? Nobody had that theory, and there are no branches in this code that could fail prediction. — Barry
– Barry, Commented Sep 2, 2015 at 20:56

Barry · Accepted Answer · 2015-09-02 20:54:34Z

3

You're not getting any locality friendliness with your code. I'd throw out two possible ideas.

Group unique and total together.

struct Stuff {
    unsigned int unique, total;
};

for(unsigned int i = 0; i < num; i++)
{
    Stuff& s = stuffs[ids[i]];
    s.unique++;
    s.total += main_list[id]; // <== is this supposed to be ids[i]?
}

That will make sure the things you're accessing in memory consecutively are actually next to each other in memory. As-is, assuming num is sufficiently large, you're cache missing on every line. That is about as bad as you can get.

Sort ids. Right now, you're still bouncing around in memory. Let's make sure we can actually go sequentially:
```
std::sort(ids, ids + num);
// rest of loop as before
```

This way, it's likely that stuffs[ids[i+1]] will be prefetched while you're processing stuffs[ids[i]]. That'll save you a lot of lookup time as well.

edited Sep 2, 2015 at 20:54

answered Sep 2, 2015 at 20:28

Barry

312k32 gold badges736 silver badges1.1k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user4992621 Over a year ago

Basically what you should be trying to do is keep objects that are used 'together' close together in memory as well.

Barry Over a year ago

@Tive Yes that is literally what I said.

user4992621 Over a year ago

@Barry: I'm wording your point differently, to maybe make it clearer what you mean. OTOH, I too, forgot to mention "locality of reference", which has better 'google-ability'....

Thomas Matthews Over a year ago

Loop unrolling is another available technique. For example, double the statements inside the loop and increment the loop counter by 2. (Remember to use i+1 in the second set).

MSalters Over a year ago

I don't think that sorting ids[] will help the prefetcher. A prefetcher sees p,p+4,p+8 and predicts p+12 or p+2,p+4,p+6 to predict p+8. As ids is non-contiguous, you're unlikely to have 4 id values in perfect order.

mattnewport · Accepted Answer · 2015-09-02 20:15:42Z

2

You may be getting hit by aliasing preventing the compiler from optimizing your loop here due to it having to allow for the possibility that unique, total and main_list overlap in memory. This may perform better:

const auto mainListId = main_list[id];
for (unsigned int i = 0; i < num; ++i) {
    const auto currId = ids[i];
    ++unique[currId];
    total[currId] += mainListId;
}

Assuming of course that there is not in fact any aliasing going on.

There's not a whole lot more you can do with such a simple loop. You can make sure your compiler optimization settings are set to maximum and you could try unrolling the loop some if the compiler is not doing it for you. Beyond that you would likely need to make algorithmic improvements beyond the scope of the code you show here.

You are likely memory bound due to the non sequential memory accesses resulting from the ordering of ids. This could perhaps be addressed by sorting your ids array before this loop but without more context for what you're trying to do it's hard to say if that makes sense.

edited Sep 2, 2015 at 20:15

answered Sep 2, 2015 at 20:03

mattnewport

14.1k2 gold badges37 silver badges43 bronze badges

1 Comment

We're All Mad Here Over a year ago

Thanks for the help, but this doesn't make any noticeable change. It's around 31.5-32%

MSalters · Accepted Answer · 2015-09-03 09:06:24Z

I'm surprised at the i = ids[j]; //16.8% - that should be faster. It looks like the timing is off. ++unique[i]; //2.0% is a non-linear (non-prefetchable) access and should be slower, not 8 times faster. In fact, ids[] should be in cache so you'd have only 1 in 8 accesses hit main memory. The statement should be 8 times faster. Are you sure you got the right times for the right operations?

That said, you should parallelize the loop. It won't help a lot; main memory won't become faster. But you should keep main memory busy. The idea of the CPU prefetcher is to throw in a few predicted accesses if there are no explicit accesses. If the prediction is right, it saves time, and otherwise it just wastes a bit of energy.

Parallelizing the loop is possible because ids[] is sorted. Even if there are duplicate values, they're adjacent, so you can find split points by finding the first occurrence of a repeated value.

Collectives™ on Stack Overflow

C++ optimize simple loop

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related