Working from your comment:
In [465]: x=np.array([[1,2,3],[4,5,6]])
In [466]: np.einsum('ij,ik', x,x)
Out[466]:
array([[17, 22, 27],
[22, 29, 36],
[27, 36, 45]])
In [467]: x=np.ones((10,3))
In [468]: np.einsum('ij,ik', x,x)
Out[468]:
array([[ 10., 10., 10.],
[ 10., 10., 10.],
[ 10., 10., 10.]])
In [471]: x=np.ones((10000000,3))
In [472]: np.einsum('ij,ik', x,x)
Out[472]:
array([[ 10000000., 10000000., 10000000.],
[ 10000000., 10000000., 10000000.],
[ 10000000., 10000000., 10000000.]])
If the array becomes too big for memory, you could try chunking it:
In [479]: res = np.zeros((3,3))
In [480]: for i in range(10):
...: res += np.einsum('ij,ik',x,x)
...:
In [481]: res
Out[481]:
array([[ 170., 220., 270.],
[ 220., 290., 360.],
[ 270., 360., 450.]])
Generally a few iterations over a complex task is faster than pushing the memory limits without any iterations. At some point the memory management costs exceed the iteration costs.
This simple einsum is just as easily solved with dot, and potentially faster:
In [484]: x.T.dot(x)
Out[484]:
array([[17, 22, 27],
[22, 29, 36],
[27, 36, 45]])
In [486]: x=np.ones((10000000,3))
In [487]: timeit np.einsum('ij,ik',x,x)
426 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [488]: x1=np.ones((50000000,3))
In [490]: timeit np.einsum('ij,ik',x1,x1)
2.14 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [493]: %%timeit
...: res = np.zeros((3,3))
...: x2 = x1.reshape(5,-1,3)
...: for i in range(5):
...: x3 = x2[i]
...: res += np.einsum('ij,ik',x3,x3)
2.1 s ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So even though I'm not hitting a memory limit yet, chunking gives me a slight speed improvement. (though I'm not far from memory limits; a few left over items in the ipython history stack can push me over).
An just for comparison, the @ (same as dot for 2d)
In [494]: %%timeit
...: res = np.zeros((3,3))
...: x2 = x1.reshape(5,-1,3)
...: for i in range(5):
...: x3 = x2[i]
...: res += x3.T@x3
537 ms ± 9.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [495]: timeit x1.T@x1
530 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [496]: timeit x.T@x
106 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
At this size chunking does not help when using dot.
nhere?x.T.dot(x).np.einsum('ij,ki->jk', x, x.T)won't give a(3,3)output. Something is not clear there.einsumis done the wrong way round you would expect memory problems if the "non-3" dim is so large