Optimize tensor contraction in C++ compared to Python

Question

I have 3 vectors(numpy arrays in Python) in C++ and in Python, I wish to do the following tensor contraction:

import numpy as np
import time

N_t, N = 400, 400
a = np.random.rand(N_t, 2, 2, N)
b = np.random.rand(2, 2, N)
c = np.random.rand(N_t, 2, 2, N)

start_time = time.time()
d = np.einsum('iacm, cdm, jbdm -> ijabm', a, b, c)
print(time.time() - start_time)

Just for simplicity, 3 arrays are generated from random. Python uses around 2 seconds.

Now, in C++, without any optimisation, I can write (with the aid of ChatGPT to avoid labour work typing out some common features)

#include <iostream>
#include <vector>
#include <chrono>
#include <random>

using namespace std;

// Function to generate a random 4D vector with given shape
std::vector<std::vector<std::vector<std::vector<double> > > > generateRandom4DVector(int dim1, int dim2, int dim3, int dim4) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<double> dis(-1.0, 1.0); // Random numbers between -1 and 1
    
    std::vector<std::vector<std::vector<std::vector<double> > > > vec(dim1,
                                                                   std::vector<std::vector<std::vector<double> > >(dim2,
                                                                                                                 std::vector<std::vector<double> >(dim3,std::vector<double>(dim4))));
    
    // Populate the vector with random values
    for (int i = 0; i < dim1; ++i) {
        for (int j = 0; j < dim2; ++j) {
            for (int k = 0; k < dim3; ++k) {
                for (int l = 0; l < dim4; ++l) {
                    vec[i][j][k][l] = dis(gen); // Generate random number and assign to vector element
                }
            }
        }
    }
    
    return vec;
}

int main() {
    
    int dim1 = 400, dim2 = 2, dim3 = 2, dim4 = 400;
    
    std::vector<std::vector<std::vector<std::vector<double> > > > x = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<double> > > > y = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<double> > > > z = generateRandom4DVector(dim1, dim2, dim3, dim4);
    std::vector<std::vector<std::vector<std::vector<std::vector<int> > > > > w(
            dim1, std::vector<std::vector<std::vector<std::vector<int> > > >(
                    dim1, std::vector<std::vector<std::vector<int> > >(
                            2, std::vector<std::vector<int> >(
                                    2, std::vector<int>(dim4)
                            )
                    )
            ));
    
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < dim1; i++) {
        for (int j = 0; j < dim1; j++) {
            for (int a = 0; a < 2; a++) {
                for (int b = 0; b < 2; b++) {
                    for (int m = 0; m < dim4; m++) {
                        for (int c = 0; c < 2; c++) {
                            for (int d = 0; d < 2; d++) {
                                w[i][j][a][b][m] += x[i][a][c][m] * y[0][c][d][m] * z[j][b][d][m];
                            }
                        }
                    }
                }
            }
        }
    }
    
    
    // Stop measuring time
    auto end = std::chrono::high_resolution_clock::now();
    
    // Calculate duration
    std::chrono::duration<double> duration = end - start;
    
    // Output duration in seconds
    std::cout << "Elapsed time: " << duration.count() << " seconds" << std::endl;
    
    
    return 0;
}

It takes around 16 seconds, very slow.

A very naive solution would be to multiprocessing those for loops by putting the very big sum into a function. And taking the advantage that there is no more space to multiprocessing in np.einsum and pure Python has very slow for loops. If I have many CPUs, I can always take the advantage of this to let my C++ faster than Python boosted with numpy since in pure C++, for loops are much faster.

I am trying to find more clever routines to this problem. How to also make judicious use of C++ libraries to speed up sums of such types?

How did you compile your C++ code and with which compiler and which version? Note that you can use optimize=True in Python to make it significantly faster (2x-3x times faster on my machine). Numpy can split the operation in multiple ones so to make it faster. In this case, it succeed to split the operation in 2 parts so to divide the number of flops required by 3. Your naive loops cannot compete with that. — Jérôme Richard
– Jérôme Richard, Commented Feb 21, 2024 at 23:27
std::vector<std::vector<std::vector<std::vector<double> > > > is really really bad as data is not stored contiguously. You should clearly not do that. You need to use flatten arrays. This is less convenient but faster. This is the standard in nearly all HPC codes and Numpy does that too. In your case std::array may be enough since the shape appeared to be constants. — Jérôme Richard
– Jérôme Richard, Commented Feb 21, 2024 at 23:28
Besides Numpy can sometimes use BLAS functions to speed up this operation. BLAS calls are highly optimized and also use multiple threads. IDK if this is the case here though. — Jérôme Richard
– Jérôme Richard, Commented Feb 21, 2024 at 23:31
Now, in C++, without any optimisation -- Then your timing information is meaningless. This is especially the case if you are using a C++ compiler that in debug mode, has "checked iterator" logic when using std::vector. Run an optimized build, and then determine if there is an issue. — PaulMcKenzie
– PaulMcKenzie, Commented Feb 21, 2024 at 23:32
Indeed. Also note that the order of the loops is not efficient since the accesses are not contiguous. m should be the last one unless the two are unrolled properly and the generated code is properly optimized (which is not). — Jérôme Richard
– Jérôme Richard, Commented Feb 21, 2024 at 23:35

sehe · Accepted Answer · 2024-02-21 23:35:45Z

1

Holy smokes. Don't trust ChatGPT. The function could be as simple as

static std::mt19937 gen(std::random_device{}());
using Matrix4 = boost::multi_array<double, 4>;

Matrix4 generateRandom4DVector(std::array<long, 4> shape) {
    std::uniform_real_distribution dis(-1.0, 1.0); // Random numbers between -1 and 1

    Matrix4 vec(shape);
    std::generate_n(vec.origin(), vec.num_elements(), bind(dis, gen));
    return vec;
}

However, I see you're going to do math with it. Use a matrix library, like Eigen, Armadillo etc. I think Boost Ublas also supports it. They will employ expression templates to heavily optimize subexpressions, or recognize cases for special-case algorithms, which is what will give you the NumPy performance.

answered Feb 21, 2024 at 23:35

sehe

400k49 gold badges475 silver badges673 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Over a year ago

Thanks. So, that means numpy performance is nearly equal to an heavily optimised c++. I mean in most cases? Of course one can be pedantic about this point. Since I came from python and are very familiar with numpy, I am wondering if I can further speed up c++ to even go beyond numpy. My task is performance critical but not too performance critical. For example, if I can do it with 100 CPU cores in 8 hours then I would not re-write it in C++.

sehe Over a year ago

Yes. Of course it depends on what operations are done on the python side, in general if you are experienced in numpy and it doesn't lack features, you are very unlikely to do better in c++ on your own.

Collectives™ on Stack Overflow

Optimize tensor contraction in C++ compared to Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related