15

In order to understand compilers and in particular assembly language better, I have been experimenting with a trivial piece of code where the sum of the first N numbers is calculated, which should result in N(N+1)/2 or N(N-1)/2.

As the code shows there are two functions:

#include <cstdint>


// Once compiled with optimization, the generated assembly has a loop

uint64_t sum1( uint64_t n ) {  
    uint64_t sum = 0;
    for ( uint64_t j=0; j<=n; ++j ) {
        sum += j;
    }
    return sum;
}

// Once compiled with optimization, the generated assembly of the following has no loop

uint64_t sum2( uint64_t n ) {  
    uint64_t sum = 0;
    for ( uint64_t j=0; j<n; ++j ) {
        sum += j;
    }
    return sum;
}

In the first function I loop from O to N i.e. j<=n and in the second function I go from O to N-1 i.e. j<n.

My understanding/observation:

  • For the first function sum1 the generated assembly has a loop while for the second function sum2 the assembly shows no loop. However, once I remove the compiler optimizations i.e. -O3, then you can finally see the loop for the second function in assembly.

  • To see the generated assembly with compiler optimization, please see this Optimized.

  • To see the generated assembly without compiler optimization, please see this non-optimized.

  • Compiler is x86-64 clang

Question: Why does the compiler optimization not show the other loop in the assembly?

14
  • 10
    The compiler seems to be observing that sum1 could loop forever, for a certain n. Commented Jan 9, 2022 at 0:39
  • 2
    The compiler’s (ideal) goal is to create the most efficient code, which has the same observable behavior as the source code (when optimizing). Add an output of the temp in the loop to force the compiler to keep the loop. Commented Jan 9, 2022 at 0:41
  • 2
    Consider sum1(std::numeric_limits<uint64_t >::max()) Commented Jan 9, 2022 at 0:48
  • 5
    @SamVarshavchik: As user17732522 notes, the compiler may assume the loop terminates; this is an axiom granted by the C or C++ standard. This is logically equivalent to assuming n != UINT64_MAX, as those are all the cases and the only cases for which the loop terminates. And so it is equivalent to saying that if the program executes with n equal to UINT64_MAX, the behavior is not defined by the C or C++ standard. It is not defined to be an infinite loop. Commented Jan 9, 2022 at 1:01
  • 2
    @IgorTandetnik: That is only for C++. In C, it does not apply to all loops; an iteration statement with a controlling expression that is a constant expression (including an absent expression, which is implicitly non-zero) may continue infinitely with no side effects. And hence you could write OP’s loop and get defined infinite-loop behavior instead of undefined behavior by moving the j <= n from the controlling expression to a conditional break;. Commented Jan 9, 2022 at 1:04

1 Answer 1

15

This is because your compiler is very, very smart, and it knows that the sum of all values from 0 to n can be calculated with a trivial mathematical formula, instead of a loop.

However, your C++ compiler also figured out that this mathematical formula cannot be used in the <= version because for certain input values a bug gets triggered that results in an infinite loop, so all bets are off, and the compiler compiles the code exactly as given.

Sign up to request clarification or add additional context in comments.

20 Comments

It is absolutely true that when debugging code you do not want these kinds of optimizations taking place. Which is precisely why optimizations require a configuration setting to enable (or disable).
If you make the infinite loop more obvious, clang will recognize it and remove it, even if the value calculated in the sum is used. It even produces an empty function without instructions, which would result in non-sense programs: godbolt.org/z/cv6ndjYhT/z/Pff5rf1eh
@Jellyboy: It's likely that the pattern-recognizer for sum-of-linear-sequence loops is looking for specific things in the loop, which presumably includes the loop being provably finite. The fact that this part of clang/LLVM fails to use value-range information from __builtin_assume could just as easily be a missed optimization. Although interestingly, if you do n &= 0xfff instead of assume, the optimizer is able to use Gauss's formula: godbolt.org/z/Mn16PYvqo (With a little less effort to avoid overflow, but still more than needed for the tiny n.)
@Jellyboy: It's the obvious and by far the most likely explanation. Some experiments on both versions are consistent with that theory, e.g. j += 3 makes the j<n version not optimize. And interestingly, using __builtin_assume( n<1000 ) defeats optimization that we otherwise get with n&=0xffff. godbolt.org/z/5j3nhdnvT. Fully empty loop removal always happens for clang regardless of __builtin_assume. (GCC doesn't do that, although an ASSUME using if(!x)__builtin_unreachable() does let GCC remove potentially-infinite empty loops, too.) godbolt.org/z/4TKfxq4rn
Yes, this is SO, we should expect good speculation, ideally backed up by sufficient evidence to go with that theory. Also useful practical advice, which this is. It's unreasonable to expect an answerer to dig into the LLVM source code and find the exact pattern-recognizer algorithm to rule out any other possibility. If someone knows of a more accurate explanation (or worse, if an explanation is totally wrong, which is not rare on performance question based on how CPUs/caches work), they can comment. But lack of definitive proof isn't IMO a problem, at least here.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.