Comparing optimized builds with switch case and polymorphism

Question

I have the need to performance test two solutions - one that uses polymorphism to execute switch on type and one that uses a switch case to select which of some functions to execute. I really need to optimize this code. I wrote the following test case (You can simply copy paste the code, compile it with g++ -std=c++14 -O3 and run it with echo 1 | ./a.out!) The code is really simple if you read it!

#include <iostream>
#include <chrono>
#include <functional>
#include <array>
#include <cassert>
#include <vector>
#include <memory>
using namespace std;

struct profiler
{
    std::string name;
    std::chrono::high_resolution_clock::time_point p;
    profiler(std::string const &n) :
        name(n), p(std::chrono::high_resolution_clock::now()) { }
    ~profiler()
    {
        using dura = std::chrono::duration<double>;
        auto d = std::chrono::high_resolution_clock::now() - p;
        std::cout << name << ": "
            << std::chrono::duration_cast<dura>(d).count()
            << std::endl;
    }
};
#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

class Base {
public:
    virtual int increment(int in) {
        return in + 2;
    }
};
class Derived : public Base {
public:
    int increment(int in) override {
        return ++in;
    }
};

int increment_one(int in) {
    return in + 2;
}
int increment_two(int in) {
    return ++in;
}
int increment_three(int in) {
    return in + 4;
}
int increment_four(int in) {
    return in + 2;
}

static constexpr unsigned long long NUMBER_LOOP{5000000000};
int main() {

    int which_function;
    cin >> which_function;

    {
        PROFILE_BLOCK("nothing");
    }

    {
        PROFILE_BLOCK("switch case");
        auto counter = 0;
        for (unsigned long long i  = 0; i < NUMBER_LOOP; ++i) {
            switch(which_function) {
            case 0:
                counter = increment_one(counter);
                break;
            case 1:
                counter = increment_two(counter);
                break;
            case 2:
                counter = increment_three(counter);
                break;
            case 3:
                counter = increment_four(counter);
                break;
            default:
                assert(false);
                break;
            }
        }
        cout << counter << endl;
    }

    {
        PROFILE_BLOCK("polymorphism");
        auto counter = 0;
        std::unique_ptr<Base> ptr_base{new Derived()};
        for (unsigned long long i = 0; i < NUMBER_LOOP; ++i) {
            counter = ptr_base->increment(counter);
        }
    }

    return 0;
}

The output I get when I build with g++ -std=c++14 -O3 and run with echo 1 | ./a.out is

nothing: 1.167e-06
705032704
switch case: 4.089e-06
polymorphism: 9.299

I am failing to understand what exactly is causing the switch-case to be almost as fast as the nothing case. Is this because of inlining? Is it because the compiler precomputes the values for each input scenario and puts them in a lookup table? What causes the switch-case to be so fast?

And how can I go about writing a more fair performance test for this scenario? In general I never understand whether the code is fast because of the straight unoptimized translation between the C++ code and assembly or whether its the compiler precomputing a value and completely skipping compilation and producing "no-op style" code.

Note the profiler struct has been copied straight off another SO answer and is not that relevant to the question other than the fact that it measures time

Note As pointed out in the comments below by @dau_sama running the same test on a linux box with gcc instead of clang results in the switch case taking much longer (3.34 in this case) but still much lesser than the polymorphism case.

a switch case is just a condition to use to execute a path or the other. Modern CPUs are very good at pipelining and branch prediction. — dau_sama
– dau_sama, Commented Aug 18, 2016 at 5:06
weird, I get different results from you: nothing: 4.6e-08 switch case: 3.781 polymorphism: 1.57663 — dau_sama
– dau_sama, Commented Aug 18, 2016 at 5:12
well if you look at the results with volatile, you can see that the switch case performs better. Really you have to think about data instead. For every polymorphic object you have an extra pointer at the start of the structure. That and data cache misses in real life application will impact your code more, rendering this simple test useless :-) — dau_sama
– dau_sama, Commented Aug 18, 2016 at 5:31
@Curious Your code is totally biased - You retrieve which_function from i/o, so the compiler cannot optimize the switch away, but you do not take that into account for the polymorphic case, so basically the compiler just remove the dynamic code and simply always call Derived::increment (look at the generated ASM). If you really want to compare both cases, you should have 4 derived classes that match the function, and instantiate one of the four depending on which_function. — Holt
– Holt, Commented Aug 18, 2016 at 6:10

Chris Beck · Accepted Answer · 2016-08-19 02:01:05Z

The problem with your code is, when you do benchmarks like this, to get meaningful results, you can't simply use a for loop and a large number. When you compile with -O3 optimizations, the compiler is permitted to hoist computations out of the loop, perform loop unrolling and things like that, and compute at compile-time the results and hard code them into the binary. Since under the "as-if" rule you can't tell the difference. That makes it hard to benchmark tiny bits of code like this, but it's also the optimizers job to make the code as fast as possible. If the optimizer can see that you're just doing the same thing over and over again, it can potentially fold all the computations together and defeat the benchmark mechanism.

To fix it, you basically need to obfuscate certain parts of the benchmark loop and benchmark framework, so that the compiler is afraid to unroll the loops or otherwise try to analyze across what are supposed to be independent runs of the code of the under test.

In my modified version of your code, I used two bits of code from the google benchmarks library. The best way to understand what is happening here is, watch a great talk by Chandler Carruth which was at CppNow 2015. https://www.youtube.com/watch?v=nXaxk27zwlk

In a nutshell, what is added are two inline assembly directives, "DoNotOptimize" and "ClobberMemory". These are empty blocks of assembly and lead to no actual instructions in the compiled code, but they are marked as asm volatile, which informs the optimizer that they have unknowable side-effects and that it shouldn't try to analyze the assembly itself. The "memory" directive means that they potentially read / write to all memory addresses. Any variable that is marked "DoNotOptimize" is considered to be "known" to this assembly, and so when either of these functions is called, that variable is effectively "scrambled" from the optimizer's reasoning -- even though these are empty collections of instructions, it is required to assume that the value could have changed in an unknowable way after these functions are called, so loop unrolling and other kinds of optimizations become unsound.

Here's my modified version of your code and ouptut:

#include <iostream>
#include <chrono>
#include <functional>
#include <array>
#include <cassert>
#include <vector>
#include <memory>
using namespace std;

// From google benchmarks framework
// See also Chandler Carruth's talk on microoptimizations and benchmarking
// https://www.youtube.com/watch?v=nXaxk27zwlk
namespace bench {

#if defined(__GNUC__)
#define BENCHMARK_ALWAYS_INLINE __attribute__((always_inline))
#else
#define BENCHMARK_ALWAYS_INLINE
#endif

template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void
DoNotOptimize(Tp const & value) {
  asm volatile("" : : "g"(value) : "memory");
}

inline BENCHMARK_ALWAYS_INLINE void
ClobberMemory() {
  asm volatile("" : : : "memory");
}

} // end namespace bench

struct profiler
{
    std::string name;
    std::chrono::high_resolution_clock::time_point p;
    profiler(std::string const &n) :
        name(n), p(std::chrono::high_resolution_clock::now()) { }
    ~profiler()
    {
        using dura = std::chrono::duration<double>;
        auto d = std::chrono::high_resolution_clock::now() - p;
        std::cout << name << ": "
            << std::chrono::duration_cast<dura>(d).count()
            << std::endl;
    }
};
#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

class Base {
public:
    virtual int increment(int in) {
        return in + 2;
    }
};
class Derived : public Base {
public:
    int increment(int in) override {
        return ++in;
    }
};

int increment_one(int in) {
    return in + 2;
}
int increment_two(int in) {
    return ++in;
}
int increment_three(int in) {
    return in + 4;
}
int increment_four(int in) {
    return in + 2;
}

static constexpr unsigned long long NUMBER_LOOP{5000000000};
int main() {

    int which_function;
    cin >> which_function;

    {
        PROFILE_BLOCK("nothing");
    }

    {
        PROFILE_BLOCK("switch case");
        auto counter = 0;
        bench::DoNotOptimize(counter);
        for (unsigned long long i  = 0; i < NUMBER_LOOP; ++i) {
            bench::DoNotOptimize(i);
            switch(which_function) {
            case 0:
                counter = increment_one(counter);
                break;
            case 1:
                counter = increment_two(counter);
                break;
            case 2:
                counter = increment_three(counter);
                break;
            case 3:
                counter = increment_four(counter);
                break;
            default:
                assert(false);
                break;
            }
            bench::ClobberMemory();
        }
        cout << counter << endl;
    }

    {
        PROFILE_BLOCK("polymorphism");
        auto counter = 0;
        bench::DoNotOptimize(counter);
        std::unique_ptr<Base> ptr_base{new Derived()};
        for (unsigned long long i = 0; i < NUMBER_LOOP; ++i) {
            bench::DoNotOptimize(i);
            counter = ptr_base->increment(counter);
            bench::ClobberMemory();
        }
    }

    return 0;
}

Here's what I get when I run it:

$ g++ -std=c++14 main.cpp
$ echo 1 |./a.out
nothing: 3.864e-06
705032704
switch case: 20.385
polymorphism: 91.0152
$ g++ -std=c++14 -O3 main.cpp
$ echo 1 |./a.out
nothing: 6.74e-07
705032704
switch case: 4.59485
polymorphism: 2.5395

Actually I'm pretty surprised by this, I thought that switch case should always be faster. So maybe the obfuscation instructions need to be adjusted, or maybe I'm just wrong.

To try to understand what the difference is, you can look at the generated assembly. You can do that using perf like Chandler does, or use something like godbolt.

Here's a link to godbolt gcc of your code. I didn't read it all, but one thing that stands out to me is that in this section:

        pushq   %r13
        pushq   %r12
        leaq    16(%rdi), %r12
        pushq   %rbp
        pushq   %rbx
        subq    $24, %rsp
        testq   %rsi, %rsi
        movq    %r12, (%rdi)
        je      .L5
        movq    %rdi, %rbx
        movq    %rsi, %rdi
        movq    %rsi, %r13
        call    strlen
        cmpq    $15, %rax
        movq    %rax, %rbp
        movq    %rax, 8(%rsp)
        ja      .L16
        cmpq    $1, %rax
        je      .L17
        testq   %rax, %rax
        jne     .L18
.L9:
        movq    8(%rsp), %rax
        movq    (%rbx), %rdx
        movq    %rax, 8(%rbx)
        movb    $0, (%rdx,%rax)
        addq    $24, %rsp
        popq    %rbx
        popq    %rbp
        popq    %r12
        popq    %r13
        ret
.L16:
        leaq    8(%rsp), %rsi
        xorl    %edx, %edx
        movq    %rbx, %rdi
        call    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
        movq    8(%rsp), %rdx
        movq    %rax, (%rbx)
        movq    %rax, %rdi
        movq    %rdx, 16(%rbx)
.L7:
        movq    %rbp, %rdx
        movq    %r13, %rsi
        call    memcpy
        jmp     .L9
.L17:
        movzbl  0(%r13), %eax
        movb    %al, 16(%rbx)
        jmp     .L9
.L5:
        movl    $.LC3, %edi
        call    std::__throw_logic_error(char const*)
.L18:

You have these consecutive jump directives: ja .L16, je .L17, jne .L18. So I think that's your switch statement probably. But when you look at where these statements jump back to, they all jump back to .L9, which doesn't go back through the switch statement. So what I suspect the optimizer is doing is hoisting the switch outside of your loop, which allows it to compute the output result of the loop easily for each possible input, and makes the benchmark appear to run in zero time.

On the other hand, when I look at the generated assembly for my version, it still has those same .L16, .L17, and .L18 jumps and they all jump to .L9. So... I'm not sure exactly what it means. But hopefully that will help you to figure it out.

Edit:

Following up on a comment made by @Holt, I adjusted your code to make the virtual case match the switch case better, so that there are four derived classes and an abstract base class. This gives me results more like what I expected. The best explanation I can give is that, maybe when there is only one derived class, the compiler is able to perform "devirtualization" or something. Modern versions of gcc will do link-time-optimizations when -O3 is passed for instance.

Results:

$ g++ -std=c++14 -O3 main.cpp
$ echo 1|./a.out 
nothing: 4.92e-07
705032704
switch case: 4.56484
polymorphism: 9.16065
$ echo 2|./a.out 
nothing: 6.25e-07
-1474836480
switch case: 5.31955
polymorphism: 9.22714
$ echo 3|./a.out 
nothing: 5.42e-07
1410065408
switch case: 3.91608
polymorphism: 9.17771

Adjusted code:

#include <iostream>
#include <chrono>
#include <functional>
#include <array>
#include <cassert>
#include <vector>
#include <memory>
using namespace std;

// From google benchmarks framework
// See also Chandler Carruth's talk on microoptimizations and benchmarking
// https://www.youtube.com/watch?v=nXaxk27zwlk
namespace bench {

#if defined(__GNUC__)
#define BENCHMARK_ALWAYS_INLINE __attribute__((always_inline))
#else
#define BENCHMARK_ALWAYS_INLINE
#endif

template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void
DoNotOptimize(Tp const & value) {
  asm volatile("" : : "g"(value) : "memory");
}

inline BENCHMARK_ALWAYS_INLINE void
ClobberMemory() {
  asm volatile("" : : : "memory");
}

} // end namespace bench

struct profiler
{
    std::string name;
    std::chrono::high_resolution_clock::time_point p;
    profiler(std::string const &n) :
        name(n), p(std::chrono::high_resolution_clock::now()) { }
    ~profiler()
    {
        using dura = std::chrono::duration<double>;
        auto d = std::chrono::high_resolution_clock::now() - p;
        std::cout << name << ": "
            << std::chrono::duration_cast<dura>(d).count()
            << std::endl;
    }
};
#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

int increment_one(int in) {
    return in + 2;
}
int increment_two(int in) {
    return ++in;
}
int increment_three(int in) {
    return in + 4;
}
int increment_four(int in) {
    return in + 2;
}

class Base {
public:
    virtual int increment(int in) = 0;
};

class Derived1 : public Base {
public:
    int increment(int in) override {
        return increment_one(in);
    }
};

class Derived2 : public Base {
public:
    int increment(int in) override {
        return increment_two(in);
    }
};

class Derived3 : public Base {
public:
    int increment(int in) override {
        return increment_three(in);
    }
};

class Derived4 : public Base {
public:
    int increment(int in) override {
        return increment_four(in);
    }
};


static constexpr unsigned long long NUMBER_LOOP{5000000000};
int main() {

    int which_function;
    cin >> which_function;

    {
        PROFILE_BLOCK("nothing");
    }

    {
        PROFILE_BLOCK("switch case");
        auto counter = 0;
        bench::DoNotOptimize(counter);
        bench::DoNotOptimize(which_function);
        for (unsigned long long i  = 0; i < NUMBER_LOOP; ++i) {
            bench::DoNotOptimize(i);
            switch(which_function) {
            case 0:
                counter = increment_one(counter);
                break;
            case 1:
                counter = increment_two(counter);
                break;
            case 2:
                counter = increment_three(counter);
                break;
            case 3:
                counter = increment_four(counter);
                break;
            default:
                assert(false);
                break;
            }
            bench::ClobberMemory();
        }
        cout << counter << endl;
    }

    {
        PROFILE_BLOCK("polymorphism");
        auto counter = 0;
        bench::DoNotOptimize(counter);
        std::unique_ptr<Base> ptr_base;
        switch(which_function) {
        case 0:
          ptr_base.reset(new Derived1());
          break;
        case 1:
          ptr_base.reset(new Derived2());
          break;
        case 2:
          ptr_base.reset(new Derived3());
          break;
        case 3:
          ptr_base.reset(new Derived4());
          break;
        default:
          assert(false);
          break;
        }
        bench::DoNotOptimize(*ptr_base);
        for (unsigned long long i = 0; i < NUMBER_LOOP; ++i) {
            bench::DoNotOptimize(i);
            counter = ptr_base->increment(counter);
            bench::ClobberMemory();
        }
    }

    return 0;
}

Alan · Accepted Answer · 2016-08-18 09:21:34Z

I get a different result:
1). without optimization

$ g++ -std=c++11 -O0 perf.cpp 
$ ./a.out 
2
nothing: 1.761e-06
18446744072234715136
switch case: 25.1785
polymorphism: 110.119

This result is normal. Calling virtual function must have a searching operation on virtual function table, but calling a non-virtual function has no this searching step.

2). with O3 optimization

$g++ -std=c++11 -O3 perf.cpp 
$ ./a.out 
2
nothing: 1.44e-07
18446744072234715136
switch case: 8.4832
polymorphism: 3.34942

ok, this result is really surprise me, but it is also normal.
The function defined in class declaration will be inlined, and the compiler may have get the virtual function address at compiling.

If you really want to know the details, read the assemble code, maybe you can use clang, read the IR code which is much more readable than assemble code. simply your code, remove the unrelated codes:

class Base {
public:
        virtual int increment(int in) {
        return in + 2;
        }
};

class Derived : public Base {
public:
    int increment(int in) override {
        return ++in;
    }
};

int increment_two(int in) {
    return ++in;
}

int main() {
    int which_function  = 2;
    int NUMBER_LOOP = 1;
    Base* ptr_base{new Derived()};

    for (long i  = 0; i < NUMBER_LOOP; ++i) {
            switch(which_function) {
            case 2:
                increment_two(1);
                break;
            }
    }

    for (long i = 0; i < NUMBER_LOOP; ++i) {
        ptr_base->increment(1);
    }

    return 0;
}

$g++ -std=c++11 -O0=3 code.cpp -S
you can read code.s

using clang:
$ clang -std=c++11 -O3  -S -emit-llvm code.cpp 
here post the clang IR code:

; ModuleID = 'xp.cpp'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind readnone uwtable
define i32 @_Z13increment_twoi(i32 %in) #0 {
entry:
  %inc = add nsw i32 %in, 1
  ret i32 %inc
}

; Function Attrs: norecurse uwtable
define i32 @main() #1 {
entry:
  ret i32 0
}

attributes #0 = { norecurse nounwind readnone uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { norecurse uwtable "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}

Joris · Accepted Answer · 2016-08-18 08:18:08Z

I don't know much about the intricate details of compiler optimizations, but I could imagine the big difference on your platform is a result of the compiler changing the order of the switch and loop, resulting in code equivalent to:

switch(which_function) {
    case 1: // execute increment_one NUMBER_LOOP times
    case 2: // execute increment_two NUMBER_LOOP times
    ...
}

given that your increment functions are simple, the compiler could then inline those functions:

switch(which_function) {
    case 1: count += 2*NUMBER_LOOP; break;
    case 2: count += NUMBER_LOOP; break;
    ...
}

This is very simple code and could explain the low execution time of the switch case on your platform. Making the count variable volatile would obviously disable this optimisation, as noted in the comments.

Of course, the only way to investigate this would be to look at the compiled binary.

Collectives™ on Stack Overflow

Comparing optimized builds with switch case and polymorphism

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related