With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l) shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum made me realize that we can sum on d dimension earlier, before multiplying by v - and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@cs95 got there first!
As per @PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
h2 = ((x * s).sum(axis=1)[:,None] * v.T); np.allclose(h, h2) >>> Trueapply_along_axisis not a speed tool; just a convenience (in some cases).