Why no compiler appears able to optimize this code?

Question

Consider the following C code (assuming 80-bit long double) (note, I do know of memcmp, this is just an experiment):

enum { sizeOfFloat80=10 }; // NOTE: sizeof(long double) != sizeOfFloat80
_Bool sameBits1(long double x, long double y)
{
    for(int i=0;i<sizeOfFloat80;++i)
        if(((char*)&x)[i]!=((char*)&y)[i])
            return 0;
    return 1;
}

All compilers I checked (gcc, clang, icc on gcc.godbolt.org) generate similar code, here's an example for gcc with options -O3 -std=c11 -fomit-frame-pointer -m32:

sameBits1:
        movzx   eax, BYTE PTR [esp+16]
        cmp     BYTE PTR [esp+4], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+17]
        cmp     BYTE PTR [esp+5], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+18]
        cmp     BYTE PTR [esp+6], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+19]
        cmp     BYTE PTR [esp+7], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+20]
        cmp     BYTE PTR [esp+8], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+21]
        cmp     BYTE PTR [esp+9], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+22]
        cmp     BYTE PTR [esp+10], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+23]
        cmp     BYTE PTR [esp+11], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+24]
        cmp     BYTE PTR [esp+12], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+25]
        cmp     BYTE PTR [esp+13], al
        sete    al
        ret
.L11:
        xor     eax, eax
        ret

This looks ugly, has branch on every byte and in fact doesn't seem to have been optimized at all (but at least the loop is unrolled). It's easy to see though that this could be optimized to the code equivalent to the following (and in general for larger data to use larger strides):

#include <string.h>
_Bool sameBits2(long double x, long double y)
{
    long long X=0; memcpy(&X,&x,sizeof x);
    long long Y=0; memcpy(&Y,&y,sizeof y);
    short Xhi=0; memcpy(&Xhi,sizeof x+(char*)&x,sizeof Xhi);
    short Yhi=0; memcpy(&Yhi,sizeof y+(char*)&y,sizeof Yhi);
    return X==Y && Xhi==Yhi;
}

And this code now gets much nicer compilation result:

sameBits2:
        sub     esp, 20
        mov     edx, DWORD PTR [esp+36]
        mov     eax, DWORD PTR [esp+40]
        xor     edx, DWORD PTR [esp+24]
        xor     eax, DWORD PTR [esp+28]
        or      edx, eax
        movzx   eax, WORD PTR [esp+48]
        sete    dl
        cmp     WORD PTR [esp+36], ax
        sete    al
        add     esp, 20
        and     eax, edx
        ret

So my question is: why is none of the three compilers able to do this optimization? It it something very uncommon to see in the C code?

Potential for undefined behaviour. long double is not required to have 10 bytes. What do you want to accomplish with that code? It looks obfuscated and like a solution in search of a problem. — too honest for this site
– too honest for this site, Commented Apr 2, 2016 at 17:09
@Olaf As I've already said, I assume this size due to the chosen target (Linux x86). — Ruslan
– Ruslan, Commented Apr 2, 2016 at 17:12
I vote up. As I explained below the question is about the optimization, not the consistency of code. And the compiler really seems unable to optimize it. Or simply the output is really the best optimization even if ugly to see. — Frankie_C
– Frankie_C, Commented Apr 2, 2016 at 17:14
Which is bad style. Never write such code without actual need. Use language features and writing full standard compliant portable code. — too honest for this site
– too honest for this site, Commented Apr 2, 2016 at 17:14
@Frankie_C: That way it is too broad. Not every question why some obscure construct is not optimised is useful or on-topic. — too honest for this site
– too honest for this site, Commented Apr 2, 2016 at 17:16

AnT stands with Russia · Accepted Answer · 2016-04-02 17:04:44Z

9

Firstly, it is unable to do this optimization because you completely obfuscated the meaning of your code by overloading it with unduly amount of memory reinterpretation. A code like this justly makes the compiler react with "I don't know what on Earth this is, but if that's what you want, that's what you'll get". Why you expect the compiler to even bother to transform on kind of memory reinterpretation into another kind of memory reinterpretation (!) is completely unclear to me.

Secondly, it can probably be made to do it in theory, but it is probably not very high on the list of its priorities. Remember, that code optimization is usually done by a pattern matching algorithm, not by some kind of A.I. And this is just not one of the patterns it recognizes.

Most of the time your manual attempts to perform low-level optimization of the code will defeat compiler's effort to do the same. If you want to optimize it yourself, then go all the way. Don't expect to be able to start and then hand it over to the compiler to finish the job for you.

Comparison of two long double values x and y can be done very easily: x == y. If you want a bit-to-bit memory comparison, you will probably make the compiler's job easier by just using memcmp in a compiler that inherently knows what memcmp is (built-in, intrinsic function).

edited Apr 2, 2016 at 17:04

answered Apr 2, 2016 at 16:59

AnT stands with Russia

323k44 gold badges548 silver badges793 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ruslan Over a year ago

The comparison can be done very easily if you want a floating-point comparison. Not as easy if you want bit-for-bit comparison.

Frankie_C Over a year ago

I can understand your point of view, but the question of OP is not about the consistency of code. It is rather on what is the reason that refrain the compiler from optimize the code, that is perfectly legal even if could appear ugly. Moreover the point made by Ruslan about the bits comparison could make some sense (the simple comparison of float values is always a problem). I think that the answer is simply that the unrolling of the for loop is the fastest code. Anyway PellesC doesn't unroll the loop and produce a code very close to the requested one.

Frankie_C Over a year ago

Good the point about the memcmp. Even if the core of the question remains the strange optimization.

Ruslan Over a year ago

About memcmp I've noted in the OP that I do know of it, and deliberately chose not to use it for the sake of experiment. Also, none of the three compilers I tested seem to be able to inline memcmp (although icc calls _intel_fast_memcmp there).

Collectives™ on Stack Overflow

Why no compiler appears able to optimize this code?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related