Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

Question

Consider the following declaration of local variables:

bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};

in x86-64 architectures, I'd expect the optimizer to reduce the initialization of those variables to something like mov qword ptr [rsp], 0. But instead what I get with all the compilers (regardless of level of optimization) I've been able to try is some form of:

mov     byte ptr [rsp + 7], 0
mov     byte ptr [rsp + 6], 0
mov     byte ptr [rsp + 5], 0
mov     byte ptr [rsp + 4], 0
mov     byte ptr [rsp + 3], 0
mov     byte ptr [rsp + 2], 0
mov     byte ptr [rsp + 1], 0
mov     byte ptr [rsp], 0

Which seems like a waste of CPU cycles. Using copy-initialization, value-initialization or replacing braces with parentheses made no difference.

But wait, that's not all. Suppose that I have this instead:

struct
{
    bool a{false};
    bool b{false};
    bool c{false};
    bool d{false};
    bool e{false};
    bool f{false};
    bool g{false};
    bool h{false};
} bools;

Then the initialization of bools generates exactly what I'd expect: mov qword ptr [rsp], 0. What gives?

You can try the code above yourself in this Compiler Explorer link.

The behavior of the different compilers is so consistent that I am forced to think there is some reason for the above inefficiency, but I have not been able to find it. Do you know why?

If you were to pull 10^6 lines of C++ from a repo, how often would this 'pattern' occur ? I suspect the answer is 'not often', and that the compiler writers (yes, all of them) have spent their optimisation efforts more productively on more frequently occurring patterns. If you like, this is an economic decision, not a purely technical one. But this is just conjecture. — High Performance Mark
– High Performance Mark, Commented Aug 11, 2020 at 9:33
@AndreasH. The stack should be 16-byte aligned, but you can place a void* X; variable before the booleans (to make sure they are 8-byte aligned) and it does not change the generated code one bit — TheProgammerd
– TheProgammerd, Commented Aug 11, 2020 at 9:40
The answer is presumably just "oversight": because your foo takes references to its parameters the compiler has to make sure each variable has an address, which is ever so slightly different from the struct case, where this optimization is more common. You could raise a bug with GCC and Clang to see what they have to say about it, but probably they just never needed to. — Botje
– Botje, Commented Aug 11, 2020 at 9:48
It's more interesting with chars for gcc gcc.godbolt.org/z/3rEPEo — Waqar
– Waqar, Commented Aug 11, 2020 at 9:51

Peter Cordes · Accepted Answer · 2020-08-11 18:21:09Z

3

Compilers are dumb, this is a missed-optimization. mov qword ptr [rsp], 0 would be optimal. Store forwarding from a qword store to a byte reload of any individual byte is efficient on modern CPUs. (https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/)

(Or even better, push 0 instead of sub rsp, 8 + mov, also a missed optimization because compilers don't bother looking for cases where that's possible.)

Presumably the optimization pass that looks for store merging runs before nailing down the locations of locals in the stack frame relative to each other. (Or before even deciding which locals can be kept in registers and which need memory addresses at all.)

Store merging aka coalescing was only recently reintroduced in GCC8 IIRC, after being dropped as a regression from GCC2.95 to GCC3, again IIRC. (I think other optimizations like assuming no strict-aliasing violations to keep more vars in registers more of the time, were more useful). So it's been missing for decades.

From one POV, you could say consider yourself lucky you're getting any store merging at all (with struct members, and array elements, that are known early to be adjacent). Of course, from another POV, compilers ideally should make good asm. But in practice missed optimizations are common. Fortunately we have beefy CPUs with wide superscalar out-of-order execution to usually chew through this crap to still see upcoming cache miss loads and stores pretty quickly, so wasted instructions sometimes have time to execute in the shadow of other bottlenecks. That's not always true, and clogging up space in the out-of-order execution window is never a good thing.

Related: In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values? covers the general case for constants other than 0, re: what the optimal asm would be. (The difference between array vs. separate locals was only discussed in comments there.)

answered Aug 11, 2020 at 18:21

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ecm Over a year ago

How about and qword [address], 0 here? I am not sure whether it applies to qword stores in 64-bit mode. Under 8086 compatibility I prefer and word [address], 0 to mov word [address], 0 because the and optimises to a sign-extended 8-bit immediate. Thus the and takes up one byte less. Disadvantage of that is presumably the RMW of and is slower than only write of mov.

ecm Over a year ago

mov qword [rsi], 0 results in 48C70600000000 (7 bytes) whereas and qword [rsi], 0 is 48832600 (4 bytes).

Peter Cordes Over a year ago

@ecm: I was only considering optimizing for speed, like -O3 code-gen, not -Os or clang -Oz code-gen. At -Os, you'd consider something like xor eax,eax / mov [rdi], rax (2 + 3 = 5 bytes): still no false dependency on the old contents, but one extra front-end uop. (xor-zeroing is as cheap as a NOP on Intel CPUs.) At -Oz you'd go full code-golf and yes, and qword [rsi], 0 is the smallest. Of course, push 0 is only 2 bytes, so a smarter compiler that can initialize + reserve space with push gets to win big, also saving a dummy push or sub rsp,8.

Collectives™ on Stack Overflow

Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related