1

Recently, I had a debate with a colleague about comparing python vs C++ in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C++ using Eigen.

Python3 numpy version matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start) 

If I run this script with

python3 matmul_numpy.py

This will return:

0.07 seconds

The C++ eigen version matmul_eigen.cpp:


#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
        clock_t start,end;
        size_t n=2000;
        Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
        Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
        start=clock();
        Eigen::MatrixXd c=a*b;
        end=clock();
        std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
        return 0;}

The way I compile it is,

g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c++17 -o matmul_eigen

this will return (both c++11 and c++17):

0.35 seconds

This is very odd to me, 1-Why numpy here is so faster than C++? Am I missing any other flags for optimization?

I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.

The compiled python script was still faster (0.11 seconds). This again add two more questions for me:

2- why it got longer? does the interpreter do anymore optimization?

3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?

I would appreciate any help,

thanks

17
  • 2
    Do these programs do the same thing? Do you get the same result from both programs? Commented Jan 7, 2023 at 0:36
  • 2
    Honestly, I don't think we can really answer this without looking at the decompiled output of both or knowing how each operates internally. Commented Jan 7, 2023 at 0:43
  • 5
    "because I am using random numbers" - Ok, you need to use the same random number generator for both. You can't compare a program using a slow PRNG with a program using a fast one. np.random.rand may use a more modern PRNG than what's included in Eigen (or standard C++ for that matter). - and, make sure you get the same result. Otherwise comparing is pointless. Seed them both the same way. Commented Jan 7, 2023 at 0:50
  • 2
    Oh, right. Ok, then, make sure that both programs work on exactly the same data. Same input, same output. Commented Jan 7, 2023 at 0:54
  • 3
    Numpy may use a multithreaded implementation of the linear algebra operations, depending on the backend used, see numpy.org/devdocs/reference/global_state.html, while Eigen doesn't with the options you used. Commented Jan 7, 2023 at 0:58

3 Answers 3

14

The biggest issue is that you are comparing two completely different things:

  • In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a@b performs a matrix multiplication.
  • In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.

The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.

On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a@b (in parallel, using OpenBLAS).

Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.

Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C++.


3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?

Python is generally compiled to bytecode which is not a native assembly code. C++ is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.

Sign up to request clarification or add additional context in comments.

8 Comments

Wow! thanks alot! your response was very helpful. I actually applied the changes you and another gentleman in the comments said, and now my compiled eigen c++ beats the python version by 10 miliseconds. Also, the clock_t was not suitable as you said, I used chrono::steady_clock. However, I still have one question, when I compile the python code into binary, it takes longer to execute, what does the python interpreter do that it makes it run faster?
Good to know :) . For the compilation, CPython generates a bytecode file the first time so not to parse the scripts later. The bytecode is cached. Thus, the first run can be a bit slower. This operation is generally pretty fast, automatic and transparent. Just to be sure to understand your question: how to you "compile the python code into binary"?
I first generate a C file out of python script with cython example_file.py --embed. Then I compile it with g++, linked with python static libs and includes. gcc -Os $(python3-config --includes) example_file.c -o output_bin_file $(python3-config --ldflags) -l$python3.6m
Cython should not help much in this case. It will generate a native code doing what CPython do but it will not be significantly faster because 1. most of the time should be spent in the Numpy code that Cython cannot optimize and 2. most of the overhead of CPython (eg. dynamic typing, object allocation, reference counting, variable-sized integers, etc.) are still present unless annotations are added so to remove them. Cython cannot make Numpy faster because the code of Numpy is already compiled.
I might be slower regarding how the program is linked: if the compiled implementation use a slower BLAS implementation, then the code can be significantly slower. Besides this, using Cython should not significantly impact the execution time of Numpy calls (at least not at the granularity of >1ms).
|
3

So based on what I learned from the @Jérôme Richard response and the comments @user17732522. It seems that I made two mistakes in the comparison,

1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a@b. not a*b which is a elementwise multiplication. 2- I didn't measure the time in C++ code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.

With applying these comments, the c++ eigen code is 10 times faster than the python's.

The updated code for matmul_eigen.cpp:

#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){

    size_t n=2000;
    Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
    Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
    auto t1=std::chrono::steady_clock::now();
    Eigen::MatrixXd c=a*b;
    auto t2=std::chrono::steady_clock::now();
    std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;

    return 0;}

To compile, both the vectorization and multi-thread flags should be considered.

g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c++17 -march=native -fopenmp -o eigen_matmul

To use the multiple threads for running the code:

OMP_NUM_THREADS=4 ./eigen_matmul

where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:

lscpu | grep "CPU(s):"

This will return 0.104 seconds.

The updated python script matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)

a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)

start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)

To run the code,

python3 matmul_numpy.py

This will return 1.0531 seconds.

About the reason that it is like this, I think @Jérôme Richard response is a better reference.

4 Comments

Since Numpy should call the dgemm BLAS call internally, and the BLAS should be the same, the performance of Eigen and Numpy should be similar (with the same configuration). The Numpy code may be a bit slower due to Numpy overhead but I do not expect more than a 50% overhead in this case regarding the input size and the provided timings. Thus, I advise you to check the Numpy configuration and more specifically the linked BLAS. See this post for more informations (there are many other posts like this).
Note that OMP_NUM_THREADS should be automatically set by the OpenMP runtime to the number of core (or thread regarding the target runtime) assuming it is not already set. OMP_DISPLAY_ENV=TRUE help to get more information about the OpenMP runtime configuration. Note that using multiple OpenMP runtime can sometime cause performance issues. Also note that thread are sometimes badly bound to cores by default and a manual thread-binding is required. You can play with OMP_PLACES to check/tune that (as well as OMP_PROC_BIND).
@JérômeRichard By default Eigen uses its own BLAS implementation. If the version used by Python is not optimized for the local CPU, and only Eigen has multithreading activated, then being 10 times faster sounds at least plausible.
@chtz The documentation say that a dgemm call is performed. But, indeed, it looks like they do not an external BLAS implementation unless EIGEN_USE_BLAS is specified if my understanding is correct. Still the default OpenBLAS implementation should be pretty efficient in this case (>70% of the optimal time on most platforms) and parallel by default. A x10 performance difference suggest that something is clearly going wrong. OpenBLAS has some thread mapping issue on some platform (eg. IBM machines) that can cause this. A bad configuration is also possible.
1

You actually did miss the flag -DNDEBUG if you are using g++. See an example of the generated assembly: https://godbolt.org/z/sxannhGdW

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.