For loop performance difference, and compiler optimization

Question

I chose David's answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what happens when setting the optimization flags on.

Jerry Coffin's answer explained what happens when setting the optimization flags for this example. What remains unanswered is why superCalculationA runs slower than superCalculationB, when B performs one extra memory reference and one addition for each iteration. Nemo's post shows the assembler output. I confirmed this compiling with the -S flag on my PC, 2.9GHz Sandy Bridge (i5-2310), running Ubuntu 12.04 64-bit, as suggested by Matteo Italia.

I was experimenting with for-loops performance when I stumbled upon the following case.

I have the following code that does the same computation in two different ways.

#include <cstdint>
#include <chrono>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}

uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init + todo; i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
    double elapsed;

    std::printf("=====================================================\n");

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

Compiling this code with

g++ main.cpp -o output --std=c++11

leads to the following result:

=====================================================
Elapsed time: 2.859 s | 2859.441 ms | 2859440.968 us
Elapsed time: 2.204 s | 2204.059 ms | 2204059.262 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

My first question is: why is the second loop running 23% faster than the first?

On the other hand, if I compile the code with

g++ main.cpp -o output --std=c++11 -O1

The results improve a lot,

=====================================================
Elapsed time: 0.318 s | 317.773 ms | 317773.142 us
Elapsed time: 0.314 s | 314.429 ms | 314429.393 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

and the difference in time almost disappears.

But I could not believe my eyes when I set the -O2 flag,

g++ main.cpp -o output --std=c++11 -O2

and got this:

=====================================================
Elapsed time: 0.000 s | 0.000 ms | 0.328 us
Elapsed time: 0.000 s | 0.000 ms | 0.208 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

So, my second question is: What is the compiler doing when I set -O1 and -O2 flags that leads to this gigantic performance improvement?

I checked Optimized Option - Using the GNU Compiler Collection (GCC), but that did not clarify things.

By the way, I am compiling this code with g++ (GCC) 4.9.1.

EDIT to confirm Basile Starynkevitch's assumption

I edited the code, now main looks like this:

int main(int argc, char **argv)
{
    int start = atoi(argv[1]);
    int end   = atoi(argv[2]);
    int delta = end - start + 1;

    std::chrono::time_point<std::chrono::high_resolution_clock> t_start, t_end;
    double elapsed;

    std::printf("=====================================================\n");

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationB(start, delta);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    t_start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationA(start, end);
    t_end = std::chrono::high_resolution_clock::now();
    elapsed = (t_end - t_start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    std::printf("Results were %s\n", (ret1 == ret2) ? "the same!" : "different!");
    std::printf("=====================================================\n");

    return 0;
}

These modifications really increased computation time, both for -O1 and -O2. Both are giving me around 620 ms now. Which proves that -O2 was really doing some computation at compile time.

I still do not understand what these flags are doing to improve performance, and -Ofast does even better, at about 320ms.

Also notice that I have changed the order in which functions A and B are called to test Jerry Coffin's assumption. Compiling this code with no optimizer flags still gives me around 2.2 secs in B and 2.8 secs in A. So I figure that it is not a cache thing. Just reinforcing that I am not talking about optimization in the first case (the one with no flags), I just want to know what makes the seconds loop run faster than the first.

Without optimizations turned on (your first case), it doesn't make sense to compare timings because the generated code is very nearly a direct translation of your code to assembly. With optimizations on, the compiler can almost certainly eliminate your loop entirely in this case. — Cameron
– Cameron, Commented Aug 29, 2014 at 21:26
I guess that with -O2 GCC is doing most computations at compile time. The arguments to SuperCalculationA & SuperCalculationB should be variable, e.g. given thru the program arguments (e.g. int init = atoi(argv[1]); int end = atoi(argv[2]); in your main) — Basile Starynkevitch
– Basile Starynkevitch, Commented Aug 29, 2014 at 21:45
I must correct myself: looking at the assembly at -O0 doesn't illuminate the issue a tiny bit. The emitted assembly is obviously a naive C->assembly translation, but, despite doing more things in very similar code (and accessing one more location on the stack) it turns out that superCalculationB is faster (confirmed by the profiler). The result holds even repeating both calculations several times in a for loop. — Matteo Italia
– Matteo Italia, Commented Aug 29, 2014 at 22:13
@jcmonteiro Thanks for selecting my answer. I did some more homework and now think I have a more solid explanation without any mysteries. Please check out my revised answer. — dshin
– dshin, Commented Sep 9, 2014 at 7:05

Jerry Coffin · Accepted Answer · 2014-08-30 16:01:46Z

11

My immediate guess would be that the second is faster, not because of the changes you made to the loop, but because it's second, so the cache is already primed when it runs.

To test the theory, I re-arranged your code to reverse the order in which the two calculations were called:

#include <cstdint>
#include <chrono>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}

uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init + todo; i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
    double elapsed;

    std::printf("=====================================================\n");

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

The result I got was:

=====================================================
Elapsed time: 0.286 s | 286.000 ms | 286000.000 us
Elapsed time: 0.271 s | 271.000 ms | 271000.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

So, when version A runs first, it's slower. When version B run's first, it's slower.

To confirm, I added an extra call to superCalculationB before doing the timing on either version A or B. After that, I tried running the program three times. For those three runs, I'd judge the results a tie (version A was faster once and version B was faster twice, but neither won dependably nor by a wide enough margin to be meaningful).

That doesn't prove that it's actually a cache situation as such, but does give a pretty strong indication that it's a matter of the order in which the functions are called, not the difference in the code itself.

As far as what the compiler does to make the code faster: the main thing it does is unroll a few iterations of the loop. We can get pretty much the same effect if we unroll a few iterations by hand:

uint64_t superCalculationC(int init, int end)
{
    int f_end = end - ((end - init) & 7);

    int i;
    uint64_t total = 0;
    for (i = init; i < f_end; i += 8) {
        total += i;
        total += i + 1;
        total += i + 2;
        total += i + 3;
        total += i + 4;
        total += i + 5;
        total += i + 6;
        total += i + 7;
    }

    for (; i < end; i++)
        total += i;

    return total;
}

This has a property that some might find rather odd: it's actually faster when compiled with -O2 than with -O3. When compiled with -O2, it's also about five times faster than either of the other two are when compiled with -O3.

The primary reason for the ~5x speed gain compared to the compiler's loop unrolling is that we've unrolled the loop somewhat differently (and more intelligently, IMO) than the compiler does. We compute f_end to tell us how many times the unrolled loop should execute. We execute those iterations, then we execute a separate loop to "clean up" any odd iterations at the end.

The compiler instead generates code that's roughly equivalent to something like this:

for (i = init; i < end; i += 8) {
    total += i;
    if (i + 1 >= end) break;
    total += i + 1;
    if (i + 2 >= end) break;
    total += i + 2;
    // ...
}

Although this is quite a bit faster than when the loop hasn't been unrolled at all, it's quite a bit faster still to eliminate those extra checks from the main loop, and execute a separate loop for any odd iterations.

Given such a trivial loop body being executed such a large number of times, you can also improve speed (when compiled with -O2) still further by unrolling more iterations of the loop. With 16 iterations unrolled, it was about twice as fast as the code above with 8 iterations unrolled:

uint64_t superCalculationC(int init, int end)
{
    int first_end = end - ((end - init) & 0xf);

    int i;
    uint64_t total = 0;
    for (i = init; i < first_end; i += 16) {
        total += i + 0;
        total += i + 1;
        total += i + 2;

        // code for `i+3` through `i+13` goes here

        total += i + 14;
        total += i + 15;
    }

    for (; i < end; i++)
        total += i;

    return total;
}

I haven't tried to explore the limit of gains from unrolling this particular loop, but unrolling 32 iterations nearly doubles the speed again. Depending on the processor you're using, you might get some small gains by unrolling 64 iterations, but I'd guess we're starting to approach the limits--at some point, performance gains will probably level off, then (if you unroll still more iterations) probably drop off, quite possibly dramatically.

Summary: with -O3 the compiler unrolls a number of iterations of the loop. This is extremely effective in this case, primarily because we have many executions of nearly the most trivial possible loop body. Unrolling the loop by hand is even more effective than letting the compiler do it--we can unroll more intelligently, and we can simply unroll more iterations than the compiler does. The extra intelligence can give us an improvement of around 5:1, and the extra iterations another 4:1 or so¹ (at the expense of somewhat longer, slightly less readable code).

Final caveat: as always with optimization, your mileage may vary. Differences in compilers and/or processors mean you're likely to get at least somewhat different results than I did. I'd expect my hand-unrolled loop to be substantially faster than the other two in most cases, but exactly how much faster is likely to vary.

^{1. But note that this is comparing the hand-unrolled loop with -O2 to the original loop with -O3. When compiled with -O3, the hand-unrolled loop runs much more slowly.}

edited Aug 30, 2014 at 16:01

answered Aug 30, 2014 at 1:19

Jerry Coffin

494k83 gold badges656 silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

20 Comments

jcmonteiro Over a year ago

Thanks for your reply, Jerry. Are you compiling this code with some optimizer flag? I double checked my code, and compiling with no -O0 flag gives me those results, 2.8 for A and 2.2 for B, even when I change the order of the calls.

Jerry Coffin Over a year ago

Yes, I tried it with -O2 and -O3 (and also with MS VC++, using -O2b2 -GL). Different total time, but same basic results with both compilers: whichever routine ran first, also ran slower).

jcmonteiro Over a year ago

When compiling with these flags they really give the same results. My point is that when compiling without them routine B runs faster than A. Try removing all optimizer flags.

Jerry Coffin Over a year ago

Looking at optimization with optimization disabled is about the most pointless exercise humanly possible. Just for grins I did compile to assembly with both -O2 and -O3. In both cases, the compiler created identical code for the two versions of the code. Any difference in execution speed is purely an artifact of how you're doing the timing.

jcmonteiro Over a year ago

I think you did not understand the question. I asked what caused the difference in time with optimization disabled and what optimization did to reduce the computation time.

|

Rusan Kax · Accepted Answer · 2014-08-29 21:53:40Z

6

Checking the assembly output is really the only way to illuminate such things.

Compiler optimisations will do a great deal of things, including things that are not strictly "standard compliant" (although, that is not the case with -O1 and -O2, to my knowledge) - for instance check, -Ofast switch.

I have found this helpful: http://gcc.godbolt.org/, and with your demo code here

edited Aug 29, 2014 at 21:53

answered Aug 29, 2014 at 21:44

Rusan Kax

1,8942 gold badges13 silver badges18 bronze badges

1 Comment

jcmonteiro Over a year ago

Thank you for your answer, Rusan. Nice website, I will certainly use it in the future. As far as assembly goes I am still learning, so if your could point out what is happening when I turn the flags -O1 and -Ofast I would appreciate. Please refer to the edited main function where I added argc and argv.

Surt · Accepted Answer · 2014-09-07 21:40:30Z

-O2

Explaining the -O2 result is easy, looking at the code from godbolt change to -O2

main:
pushq   %rbx
movl    $.LC2, %edi
call    puts
call    std::chrono::_V2::system_clock::now()
movq    %rax, %rbx
call    std::chrono::_V2::system_clock::now()
pxor    %xmm0, %xmm0
subq    %rbx, %rax
movsd   .LC4(%rip), %xmm2
movl    $.LC6, %edi
movsd   .LC5(%rip), %xmm1
cvtsi2sdq   %rax, %xmm0
movl    $3, %eax
mulsd   .LC3(%rip), %xmm0
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm1
call    printf
call    std::chrono::_V2::system_clock::now()
movq    %rax, %rbx
call    std::chrono::_V2::system_clock::now()
pxor    %xmm0, %xmm0
subq    %rbx, %rax
movsd   .LC4(%rip), %xmm2
movl    $.LC6, %edi
movsd   .LC5(%rip), %xmm1
cvtsi2sdq   %rax, %xmm0
movl    $3, %eax
mulsd   .LC3(%rip), %xmm0
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm1
call    printf
movl    $.LC7, %edi
call    puts
movl    $.LC8, %edi
call    puts
movl    $.LC2, %edi
call    puts
xorl    %eax, %eax
popq    %rbx
ret

There is no call to the 2 functions, further there is no compare of the results.

Now why can that be? its of course the power of optimization, the program is too simple ...

First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.

It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111

Both loops are now known to be containing only compile time values. They are further completely the same:

uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
    total += i;
return total;

The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of

Rewriting Gauss's formel s=n*(n+1)

111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221

loops = 1000000111-111 = 1E9

half it as we got the double of the looked for

1000000221 * 1E9 / 2 = 500000110500000000

which is the result looked for 500000110500000000

Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.

The time noted is the minimum time for system_clock on your PC.

-O0

The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some

asm("nop");

in front of A's loop, 2-3 might do the trick. Storeforwards also like that their values are naturally aligned.

dshin · Accepted Answer · 2014-09-09 07:01:52Z

EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.

It appears that the performance difference in the -O0 case is due to processor pipelining.

First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:

superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # end
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L7
.L8:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)    # init
    movl    %esi, -24(%rbp)    # todo
    movq    $0, -8(%rbp)       # total = 0
    movl    -20(%rbp), %eax    # copy init to register rax
    movl    %eax, -12(%rbp)    # i = [rax]
    jmp .L11
.L12:
    movl    -12(%rbp), %eax    # copy i to register rax
    cltq
    addq    %rax, -8(%rbp)     # total += [rax]
    addl    $1, -12(%rbp)      # i++
.L11:
    movl    -20(%rbp), %edx    # copy init to register rdx
    movl    -24(%rbp), %eax    # copy todo to register rax
    addl    %edx, %eax         # [rax] += [rdx]  (so [rax] = init+todo)
    cmpl    -12(%rbp), %eax    # i < [rax]
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

In both functions, the stack layout looks like this:

Addr Content

24   end/todo
20   init
16   <empty>
12   i
08   total
04   
00   <base pointer>

(Note that total is a 64-bit int and so occupies two 4-byte slots.)

These are the key lines of superCalculationA():

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -12(%rbp), %eax    # copy i to register rax
    cmpl    -24(%rbp), %eax    # [rax] < end

The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().

You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison i<end is performed by reading i from a register, while in superCalculationB(), the comparison i<init+todo is performed by reading i from the stack.

To demonstrate that this is just an artifact, let's replace

for (int i = init; i < end; i++)

with

for (int i = init; end > i; i++)

in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:

    addl    $1, -12(%rbp)      # i++
.L7:
    movl    -24(%rbp), %eax    # copy end to register rax
    cmpl    -12(%rbp), %eax    # i < [rax]

Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:

=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write i<end or end>i.

Nemo · Accepted Answer · 2014-08-30 21:11:50Z

2

(This is not exactly an answer, but it does include more data, including some that conflicts with Jerry Coffin's.)

The interesting question is why the unoptimized routines perform so differently and counter-intuitively. The -O2 and -O3 cases are relatively simple to explain, and others have done so.

For completeness, here is the assembly (thanks @Rutan Kax) for superCalculationA and superCalculationB produced by GCC 4.9.1:

superCalculationA(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)
    movl    %esi, -24(%rbp)
    movq    $0, -8(%rbp)
    movl    -20(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp .L7
.L8:
    movl    -12(%rbp), %eax
    cltq
    addq    %rax, -8(%rbp)
    addl    $1, -12(%rbp)
.L7:
    movl    -12(%rbp), %eax
    cmpl    -24(%rbp), %eax
    jl  .L8
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

superCalculationB(int, int):
    pushq   %rbp
    movq    %rsp, %rbp
    movl    %edi, -20(%rbp)
    movl    %esi, -24(%rbp)
    movq    $0, -8(%rbp)
    movl    -20(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp .L11
.L12:
    movl    -12(%rbp), %eax
    cltq
    addq    %rax, -8(%rbp)
    addl    $1, -12(%rbp)
.L11:
    movl    -20(%rbp), %edx
    movl    -24(%rbp), %eax
    addl    %edx, %eax
    cmpl    -12(%rbp), %eax
    jg  .L12
    movq    -8(%rbp), %rax
    popq    %rbp
    ret

It sure looks to me like B is doing more work.

My test platform is a 2.9GHz Sandy Bridge EP processor (E5-2690) running Red Hat Enterprise 6 Update 3. My compiler is GCC 4.9.1 and produces the assembly above.

To make sure Turbo Boost and related CPU-frequency-diddling technologies are not interfering with the measurement, I ran:

pkill cpuspeed # if you have it running
grep MHz /proc/cpuinfo # to see where you start
modprobe acpi_cpufreq # if you do not have it loaded
cd /sys/devices/system/cpu 
for cpuN in cpu[0-9]* ; do
    echo userspace > $cpuN/cpufreq/scaling_governor
    echo 2000000 > $cpuN/cpufreq/scaling_setspeed
done
grep MHz /proc/cpuinfo # to see if it worked

This pins the CPU frequency to 2.0 GHz and disables Turbo Boost.

Jerry observed these two routines running faster or slower depending on the order in which he executed them. I could not reproduce that result. For me, superCalculationB consistently runs 25-30% faster than superCalculationA, regardless of the Turbo Boost or clock speed settings. That includes running them multiple times in arbitrary order. For example, at 2.0GHz superCalculationA consistently takes a little over 4500ms and superCalculationB consistently takes at little under 3600ms.

I have yet to see any theory that even begins to explain this.

answered Aug 30, 2014 at 21:11

Nemo

71.9k11 gold badges125 silver badges160 bronze badges

3 Comments

jcmonteiro Over a year ago

Thanks for your reply, Nemo. I am also using GCC 4.9.1 and it produces the same assembly as yours did. I am wandering throught this assembly output to see if I can find what is going on. For me it seems that superCalculationB is doing more work too.

Jerry Coffin Over a year ago

B is doing one load and one addition that A isn't. That gives 6 memory references instead of 5, and most of the reason this is so much slower than with -O2 (for example) is due to the constant reference to memory instead of holding values in registers. Hard to see how 20% more memory references (or 17%, depending on viewpoint) could lead to a 25% speed difference though.

Nemo Over a year ago

@JerryCoffin: Also B is 25% faster than A. So 20% more memory references is resulting in a 25% increase in speed. Store forwarding does mean a memory reference isn't always a memory reference... But still, I have no idea what is going on here.

gnasher729 · Accepted Answer · 2014-09-03 09:22:16Z

Processors are complicated. Execution time depends on many things, many of which are outside your control. Just a few possibilities:

a. Your computer probably doesn't have a constant clock speed. It could be that the clock speed is usually set rather low to avoid wasting energy / battery life / producing excessive heat. When your program starts running, the OS figures out that power is needed and increases the clock speed. To verify, change the order of the calls - if the second loop executed is always faster than the first one, that may be the reason.

b. The exact execution speed, especially for a tight loop like yours, depends on how instructions are aligned in memory. Some processors may run a loop faster if it is completely contained within one cache line instead of two, or in two cache lines instead of three. Some compilers will add nop instructions to align loops on cache lines to optimise for this, most don't. Quite possible that one of the loops was aligned better by pure luck and therefore runs faster.

c. The exact execution speed may depend on the exact order in which instructions are dispatched. Slightly different code may run at different speeds due to subtle differences in the code which may be processor dependent, and anyway may be hard for the compiler to consider.

d. There is some evidence that Intel processors may have problems with artificially short loops which may happen only with artificial benchmarks. Your code is quite close to "artificial". There have been cases discussed in other threads where very short loops ran unexpectedly slow, and adding instructions made them run faster.

I increased the number of instructions inside both loops and A gradually became faster than B. I think the possibility you presented in "d." was the answer. Could you present further information regarding this subject so I can accept your answer?

marc_s · Accepted Answer · 2023-10-28 21:01:01Z

Answer of first question:

It makes faster after doing it once for for loops but i am not sure just commenting according to my experiment results.(experiment 1 change their names(B->A,A->B) experiment 2 run one function has for loop before time checks,experiment 3 start one for loop before time checks)
First programs should work faster the reason is second function is does 2 operation when first function does 1 operation.

I leave here updated code which explain my answer.

Answer of second question:

I am not sure but there can be two ways coming my mind,

It can be formalize your function in some way and get rid of loops because the difference can be destroyed by that way(like "return end-init" or "return todo" I don't know, I'm not sure)

It has -fauto_inc_dec and it can make that difference because these functions all about increments and decrements.

I hope it can help.

#include <cstdint>
#include <ctime>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}
uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init+todo; i++)
        total += i;
    return total;
}
int add(int a1,int a2){printf("multiple times added\n");return a1+a2;}
uint64_t superCalculationC(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < add(init , todo); i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::clock_t start=clock();
    double elapsed;

    std::printf("=====================================================\n");

    superCalculationA(111, 1000000111);

    start = clock();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed,    1e+6*elapsed);

    start = clock();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    elapsed = ((std::clock()-start)*1.0/CLOCKS_PER_SEC);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

Collectives™ on Stack Overflow

For loop performance difference, and compiler optimization

7 Answers 7

20 Comments

1 Comment

-O2

-O0

1 Comment

Comments

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

20 Comments

1 Comment

-O2

-O0

1 Comment

Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related