With your .transpose((1, 2, 0)) data, the correct form is:
"ijs,sk" # -> ijk
Since for a tensor A and B, we can write:
C_{ijk} = Σ_s A_{ijs} * B_{sk}
If you want to avoid transposing your data beforehand, you can just permute the indices:
"sij,sk" # -> ijk
To verify:
p, q, r = 2466, 2498, 9
a = np.random.randint(255, size=(p, q, r))
b = np.random.randint(255, size=(p, p))
c1 = a.transpose((1, 2, 0)) @ b
c2 = np.einsum("sij,sk", a, b)
>>> np.all(c1 == c2)
True
The amount of multiplications needed to compute this for (p, q, r) shaped data is p * np.prod(c.shape) == p * (q * r * p) == p**2 * q * r. In your case, that is 136_716_549_192 multiplications. You also need approximately the same number of additions, so that gives us somewhere close to 270 billion operations. If you want more speed, you could consider using a GPU for your computations via cupy.
def with_np():
p, q, r = 2466, 2498, 9
a = np.random.randint(255, size=(p, q, r))
b = np.random.randint(255, size=(p, p))
c1 = a.transpose((1, 2, 0)) @ b
c2 = np.einsum("sij,sk", a, b)
def with_cp():
p, q, r = 2466, 2498, 9
a = cp.random.randint(255, size=(p, q, r))
b = cp.random.randint(255, size=(p, p))
c1 = a.transpose((1, 2, 0)) @ b
c2 = cp.einsum("sij,sk", a, b)
>>> timeit(with_np, number=1)
513.066
>>> timeit(with_cp, number=1)
0.197
That's a speedup of 2600, including memory allocation, initialization, and CPU/GPU copy times! (A more realistic benchmark would give an even larger speedup.)
data*correlation_matrix.sum(), assuming youreinsumworks.einsumjust sums all values ofcorrelation_matrixand multipliesdataby the resulting scalar. That probably not what you want.htopto see if your computer has enough RAM to do the operation without SWAPing to secondary memory (hard drive)?