35

I've been playing around a little bit with x86-64 assembly trying to learn more about the various SIMD extensions that are available (MMX, SSE, AVX).

In order to see how different C or C++ constructs are translated into machine code by GCC I've been using Compiler Explorer which is a superb tool.

During one of my 'play sessions' I wanted to see how GCC could optimize a simple run-time initialization of an integer array. In this case I tried to write the numbers 0 to 2047 to an array of 2048 unsigned integers.

The code looks as follows:

unsigned int buffer[2048];

void setup()
{
  for (unsigned int i = 0; i < 2048; ++i)
  {
    buffer[i] = i;
  }
}

If I enable optimizations and AVX-512 instructions -O3 -mavx512f -mtune=intel GCC 6.3 generates some really clever code :)

setup():
        mov     eax, OFFSET FLAT:buffer
        mov     edx, OFFSET FLAT:buffer+8192
        vmovdqa64       zmm0, ZMMWORD PTR .LC0[rip]
        vmovdqa64       zmm1, ZMMWORD PTR .LC1[rip]
.L2:
        vmovdqa64       ZMMWORD PTR [rax], zmm0
        add     rax, 64
        cmp     rdx, rax
        vpaddd  zmm0, zmm0, zmm1
        jne     .L2
        ret
buffer:
        .zero   8192
.LC0:
        .long   0
        .long   1
        .long   2
        .long   3
        .long   4
        .long   5
        .long   6
        .long   7
        .long   8
        .long   9
        .long   10
        .long   11
        .long   12
        .long   13
        .long   14
        .long   15
.LC1:
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16
        .long   16

However, when I tested what would be generated if the same code was compiled using the GCC C-compiler by adding the flags -x c I was really surprised.

I expected similar, if not identical, results but the C-compiler seems to generate much more complicated and presumably also much slower machine code. The resulting assembly is too large to paste here in full, but it can be viewed at godbolt.org by following this link.

A snippet of the generated code, lines 58 to 83, can be seen below:

.L2:
        vpbroadcastd    zmm0, r8d
        lea     rsi, buffer[0+rcx*4]
        vmovdqa64       zmm1, ZMMWORD PTR .LC1[rip]
        vpaddd  zmm0, zmm0, ZMMWORD PTR .LC0[rip]
        xor     ecx, ecx
.L4:
        add     ecx, 1
        add     rsi, 64
        vmovdqa64       ZMMWORD PTR [rsi-64], zmm0
        cmp     ecx, edi
        vpaddd  zmm0, zmm0, zmm1
        jb      .L4
        sub     edx, r10d
        cmp     r9d, r10d
        lea     eax, [r8+r10]
        je      .L1
        mov     ecx, eax
        cmp     edx, 1
        mov     DWORD PTR buffer[0+rcx*4], eax
        lea     ecx, [rax+1]
        je      .L1
        mov     esi, ecx
        cmp     edx, 2
        mov     DWORD PTR buffer[0+rsi*4], ecx
        lea     ecx, [rax+2]

As you can see, this code has a lot of complicated moves and jumps and in general feels like a very complex way of performing a simple array initialization.

Why is there such a big difference in the generated code?

Is the GCC C++-compiler better in general at optimizing code that is valid in both C and C++ when compared to the C-compiler?

14
  • 2
    Additional data point: using static unsigned int buffer[2048]; makes the C code similar too. You will have to actually use the buffer so that it does not get totally eliminated though. Looks like it's an alignment issue, the extra code is there to handle misalignment. Commented Dec 22, 2016 at 23:19
  • 9
    @Olaf maybe you could fill us in on the difference between semantics in C and C++ for this piece of code Commented Dec 22, 2016 at 23:23
  • 2
    @Jester Pro tip for godbolt, putting void g(void *); g(buffer); will prevent buffer being optimized out Commented Dec 22, 2016 at 23:24
  • 5
    @Olaf Why should it not ? If you have some specific insight into how and why gcc does what it does in this case, add an answer, as it's basically what what OP asks. Commented Dec 22, 2016 at 23:26
  • 2
    Putting unsigned int buffer[2048] = { 0 }; also generates the simpler code. Maybe Olaf is actually onto something , in C unsigned int buffer[2048] is a tentative definition, something C++ doesn't have. This does not actually affect the observable behaviour of the program but obviously it has some influence on the GCC code generation. Commented Dec 22, 2016 at 23:26

1 Answer 1

41

The extra code is for handling misalignment because the instruction used, vmovdqa64, requires 64 byte alignment.

My testing shows that even though the standard doesn't, gcc does allow a definition in another module to override the one here when in C mode. That definition might only comply with the basic alignment requirements (4 bytes) thus the compiler can't rely on the bigger alignment. Technically, gcc emits a .comm assembly directive for this tentative definition, while an external definition uses a normal symbol in the .data section. During linking this symbol takes precedence over the .comm one.

Note if you change the program to use extern unsigned int buffer[2048]; then even the C++ version will have the added code. Conversely, making it static unsigned int buffer[2048]; will turn the C version into the optimized one.

Sign up to request clarification or add additional context in comments.

14 Comments

Note, for compiling the C code version with gcc , you can add the -fno-common compiler flag, or annotate the buffer variable with __attribute__((aligned(64))) , and it will generate similar code to the C++ version.
@M.M actually this is a declaration, and it has external linkage in C but internal in C++. Having internal linkage means it must be defined in this module and the compiler will do that. For C, it may very well be defined in another module so the compiler might have to work with that. Of course adding an initializer will turn it into a definition and you can't have more of those in C either so the compiler can generate the optimized code.
@Jester The tentative definition may only be redefined within the same translation unit. A tentative definition is a definition, not "just a declaration" in any circumstance. Your point is still wrong in Standard C.
The longer I look at it, the more it seems like you are correct. However, intentionally or not, but gcc allows the tentative definition to be overridden by an initializer from a different module and that will pull its own alignment with it. Given that it's undefined behavior this may thus change with later versions but for the version in the question this seems to be the case.
Allowing definitions with no initializer in multiple translation units is a Unix extension (which is an odd concept, since Unix had C for nearly 20 years before ANSI came along) which (I believe is still) used in GNU programs. I find it highly unlikely that GCC would actually drop support for it, even the ANSI standard technically allows it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.