I have 3 vectors(numpy arrays in Python) in C++ and in Python, I wish to do the following tensor contraction:
import numpy as np
import time
N_t, N = 400, 400
a = np.random.rand(N_t, 2, 2, N)
b = np.random.rand(2, 2, N)
c = np.random.rand(N_t, 2, 2, N)
start_time = time.time()
d = np.einsum('iacm, cdm, jbdm -> ijabm', a, b, c)
print(time.time() - start_time)
Just for simplicity, 3 arrays are generated from random. Python uses around 2 seconds.
Now, in C++, without any optimisation, I can write (with the aid of ChatGPT to avoid labour work typing out some common features)
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
using namespace std;
// Function to generate a random 4D vector with given shape
std::vector<std::vector<std::vector<std::vector<double> > > > generateRandom4DVector(int dim1, int dim2, int dim3, int dim4) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<double> dis(-1.0, 1.0); // Random numbers between -1 and 1
std::vector<std::vector<std::vector<std::vector<double> > > > vec(dim1,
std::vector<std::vector<std::vector<double> > >(dim2,
std::vector<std::vector<double> >(dim3,std::vector<double>(dim4))));
// Populate the vector with random values
for (int i = 0; i < dim1; ++i) {
for (int j = 0; j < dim2; ++j) {
for (int k = 0; k < dim3; ++k) {
for (int l = 0; l < dim4; ++l) {
vec[i][j][k][l] = dis(gen); // Generate random number and assign to vector element
}
}
}
}
return vec;
}
int main() {
int dim1 = 400, dim2 = 2, dim3 = 2, dim4 = 400;
std::vector<std::vector<std::vector<std::vector<double> > > > x = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<double> > > > y = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<double> > > > z = generateRandom4DVector(dim1, dim2, dim3, dim4);
std::vector<std::vector<std::vector<std::vector<std::vector<int> > > > > w(
dim1, std::vector<std::vector<std::vector<std::vector<int> > > >(
dim1, std::vector<std::vector<std::vector<int> > >(
2, std::vector<std::vector<int> >(
2, std::vector<int>(dim4)
)
)
));
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < dim1; i++) {
for (int j = 0; j < dim1; j++) {
for (int a = 0; a < 2; a++) {
for (int b = 0; b < 2; b++) {
for (int m = 0; m < dim4; m++) {
for (int c = 0; c < 2; c++) {
for (int d = 0; d < 2; d++) {
w[i][j][a][b][m] += x[i][a][c][m] * y[0][c][d][m] * z[j][b][d][m];
}
}
}
}
}
}
}
// Stop measuring time
auto end = std::chrono::high_resolution_clock::now();
// Calculate duration
std::chrono::duration<double> duration = end - start;
// Output duration in seconds
std::cout << "Elapsed time: " << duration.count() << " seconds" << std::endl;
return 0;
}
It takes around 16 seconds, very slow.
A very naive solution would be to multiprocessing those for loops by putting the very big sum into a function. And taking the advantage that there is no more space to multiprocessing in np.einsum and pure Python has very slow for loops. If I have many CPUs, I can always take the advantage of this to let my C++ faster than Python boosted with numpy since in pure C++, for loops are much faster.
I am trying to find more clever routines to this problem. How to also make judicious use of C++ libraries to speed up sums of such types?
optimize=Truein Python to make it significantly faster (2x-3x times faster on my machine). Numpy can split the operation in multiple ones so to make it faster. In this case, it succeed to split the operation in 2 parts so to divide the number of flops required by 3. Your naive loops cannot compete with that.std::vector<std::vector<std::vector<std::vector<double> > > >is really really bad as data is not stored contiguously. You should clearly not do that. You need to use flatten arrays. This is less convenient but faster. This is the standard in nearly all HPC codes and Numpy does that too. In your casestd::arraymay be enough since the shape appeared to be constants.std::vector. Run an optimized build, and then determine if there is an issue.mshould be the last one unless the two are unrolled properly and the generated code is properly optimized (which is not).