I've been playing around a little bit with x86-64 assembly trying to learn more about the various SIMD extensions that are available (MMX, SSE, AVX).
In order to see how different C or C++ constructs are translated into machine code by GCC I've been using Compiler Explorer which is a superb tool.
During one of my 'play sessions' I wanted to see how GCC could optimize a simple run-time initialization of an integer array. In this case I tried to write the numbers 0 to 2047 to an array of 2048 unsigned integers.
The code looks as follows:
unsigned int buffer[2048];
void setup()
{
for (unsigned int i = 0; i < 2048; ++i)
{
buffer[i] = i;
}
}
If I enable optimizations and AVX-512 instructions -O3 -mavx512f -mtune=intel GCC 6.3 generates some really clever code :)
setup():
mov eax, OFFSET FLAT:buffer
mov edx, OFFSET FLAT:buffer+8192
vmovdqa64 zmm0, ZMMWORD PTR .LC0[rip]
vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip]
.L2:
vmovdqa64 ZMMWORD PTR [rax], zmm0
add rax, 64
cmp rdx, rax
vpaddd zmm0, zmm0, zmm1
jne .L2
ret
buffer:
.zero 8192
.LC0:
.long 0
.long 1
.long 2
.long 3
.long 4
.long 5
.long 6
.long 7
.long 8
.long 9
.long 10
.long 11
.long 12
.long 13
.long 14
.long 15
.LC1:
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
.long 16
However, when I tested what would be generated if the same code was compiled using the GCC C-compiler by adding the flags -x c I was really surprised.
I expected similar, if not identical, results but the C-compiler seems to generate much more complicated and presumably also much slower machine code. The resulting assembly is too large to paste here in full, but it can be viewed at godbolt.org by following this link.
A snippet of the generated code, lines 58 to 83, can be seen below:
.L2:
vpbroadcastd zmm0, r8d
lea rsi, buffer[0+rcx*4]
vmovdqa64 zmm1, ZMMWORD PTR .LC1[rip]
vpaddd zmm0, zmm0, ZMMWORD PTR .LC0[rip]
xor ecx, ecx
.L4:
add ecx, 1
add rsi, 64
vmovdqa64 ZMMWORD PTR [rsi-64], zmm0
cmp ecx, edi
vpaddd zmm0, zmm0, zmm1
jb .L4
sub edx, r10d
cmp r9d, r10d
lea eax, [r8+r10]
je .L1
mov ecx, eax
cmp edx, 1
mov DWORD PTR buffer[0+rcx*4], eax
lea ecx, [rax+1]
je .L1
mov esi, ecx
cmp edx, 2
mov DWORD PTR buffer[0+rsi*4], ecx
lea ecx, [rax+2]
As you can see, this code has a lot of complicated moves and jumps and in general feels like a very complex way of performing a simple array initialization.
Why is there such a big difference in the generated code?
Is the GCC C++-compiler better in general at optimizing code that is valid in both C and C++ when compared to the C-compiler?
static unsigned int buffer[2048];makes the C code similar too. You will have to actually use thebufferso that it does not get totally eliminated though. Looks like it's an alignment issue, the extra code is there to handle misalignment.void g(void *); g(buffer);will prevent buffer being optimized outunsigned int buffer[2048] = { 0 };also generates the simpler code. Maybe Olaf is actually onto something , in Cunsigned int buffer[2048]is a tentative definition, something C++ doesn't have. This does not actually affect the observable behaviour of the program but obviously it has some influence on the GCC code generation.