I have a bunch of functions that complete in a few nanoseconds and I want to measure their execution time to assess their performance.
The basic idea is very simple, take a timestamp, execute the same blocks of code in a loop many times, take another timestamp, calculate the difference between the timestamps and divide the value by the number of iterations.
It is all very simple, the problem is when I put the benchmarking code in the main function, the compiler recognizes that I do the same calculations and do nothing with the result, so it makes the loop no-operation. So I get very small numbers.
I have managed to make the code execute by faking a usage, it works in Visual Studio 2022 but I have failed to make it work in Code::Blocks (MSYS2 g++ compiler).
This method is difficult to reuse, since whenever I want to benchmark another function I have to copy-paste the same lines of code. So I made it a function, and this increases reusability but blows the numbers way up. A lot of my functions are inlined, and using the function to benchmark other functions the inline optimization is disregarded.
Minimal reproducible example:
#include <array>
#include <chrono>
#include <cmath>
#include <functional>
#include <iomanip>
#include <iostream>
#include <utility>
#include <vector>
using std::array;
using std::chrono::steady_clock;
using std::chrono::duration;
using std::cout;
using std::vector;
using std::function;
double d = 0.0;
float r = 0.0;
constexpr double DU = 1.0 / (uint64_t(1) << 52);
constexpr float FU = 1.0 / (1 << 23);
constexpr array<double, 10> LOG2_POLY9 = {
-3.27179702e00,
7.19195108e00,
-7.34289702e00,
5.01474324e00,
-1.64079955e00,
-3.25941179e-01,
5.83100708e-01,
-2.58134034e-01,
5.44419681e-02,
-4.66794032e-03,
};
template <std::size_t N, std::size_t... I>
inline double apply_poly_impl(double m1, const std::array<double, N>& data, std::index_sequence<I...>) {
double s = 0;
double m = 1;
((s += std::get<I>(data) * m, m *= m1), ...);
return s;
}
template <std::size_t N>
inline double apply_poly(double m1, const std::array<double, N>& data) {
return apply_poly_impl(m1, data, std::make_index_sequence<N>{});
}
inline float fast_log2_p9(float f) {
uint32_t bits = std::bit_cast<uint32_t>(f);
int e = ((bits >> 23) & 0xff) - 127;
double m1 = 1 + (bits & 0x7fffff) * FU;
double s = apply_poly(m1, LOG2_POLY9);
return e + s;
}
template <typename T>
double timeit(const function<T(T)>& func, const vector<T>& values, int runs = 1048576){
auto start = steady_clock::now();
size_t len = values.size();
for (int64_t i = 0; i < runs; i++) {
func(values[i % len]);
}
auto end = steady_clock::now();
duration<double, std::nano> time = end - start;
return time.count() / runs;
}
int main()
{
double r4096 = 1.0 / 4096;
double n;
vector<double> random_doubles(256);
vector<float> random_floats(256);
for (int j = 0; j < 256; j++) {
n = rand() % 4096 + (rand() % 4096) * r4096;
random_doubles[j] = n;
random_doubles[j] = float(n);
}
cout << std::setprecision(16);
auto start = steady_clock::now();
for (int64_t i = 0; i < 1048576; i++) {
r = fast_log2_p9(random_floats[i % 256]);
}
auto end = steady_clock::now();
duration<double, std::nano> time = end - start;
r = fast_log2_p9(0.7413864135742188f);
cout << "fast_log_p9: " << time.count() / 1048576 << " nanoseconds\n";
cout << r << '\n';
cout << timeit<float>(fast_log2_p9, random_floats);
}
Compiled in Visual Studio 2022 with the following flags:
/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob1 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:fast /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\exponentiation.pch" /diagnostics:column
Output:
PS C:\Users\Xeni> C:\Users\Xeni\source\repos\exponentiation\x64\Release\exponentiation.exe
fast_log_p9: 0.9078025817871094 nanoseconds
-0.4317024052143097
15.96231460571289
Compiled with Code::Blocks, commands:
g++.exe -Wall -fexceptions -fomit-frame-pointer -fexpensive-optimizations -flto -O3 -m64 --std=c++20 -march=native -fext-numeric-literals -c D:\MyScript\CodeBlocks\testapp\main.cpp -o obj\Release\main.o
g++.exe -o bin\Release\testapp.exe obj\Release\main.o -O3 -flto -s -static-libstdc++ -static-libgcc -static -m64
Output:
PS C:\Users\Xeni> D:\MyScript\CodeBlocks\testapp\bin\Release\testapp.exe
fast_log_p9: 9.5367431640625e-05 nanoseconds
-0.4317024052143097
7.897567749023438
So how can I measure the execution time of a code-block with all optimizations on with a method that is easily reusable?
For whatever it is worth I have benchmarked std::log2 provided by Visual Studio 2022 and mine is clearly faster and as precise with /fp:fast flag:
start = steady_clock::now();
for (int64_t i = 0; i < 1048576; i++) {
r = std::log2f(random_floats[i % 256]);
}
end = steady_clock::now();
time = end - start;
r = std::log2f(0.7413864135742188f);
cout << "std::log2f: " << time.count() / 1048576 << " nanoseconds\n";
cout << r << '\n';
fast_log2_p9: 0.9081840515136719 nanoseconds
-0.4317024052143097
std::log2f: 54.67109680175781 nanoseconds
-0.4317024052143097
Additionally, I use Visual Studio 2022 on Windows 10 x64, it should be extremely obvious, and quick-bench.com doesn't use MSVC compiler. My functions use inline, I have just seen that with inline enabled my function clocked 41 nanoseconds, but after I removed the three inline keywords the function clocked 29 nanoseconds. There would be a very high error because I have only run it twice.
But clearly the benchmark disregards inline because disabling the inline should make the code run much slower, it doesn't.
And benchmark::DoNotOptimize(r); is used, while I don't know its exact purpose, clearly the optimization isn't enabled, and my question clearly stated that I want to measure the execution time of the function WITH ALL OPTIMIZATIONS ENABLED, because of course I would use the code with all optimizations.
And while it is true CPU's don't execute C++ code and C++ functions aren't translated into a continuous sequence of instructions, clearly I am stating that I want to measure the time that a function would typically take to complete, an average across many runs that I should expect.
r +=andd +=.