I have a M x N array 'A' that is to be distributed over 'np' processors using MPI in the second dimension(i.e N is the direction that is scattered). Each processor will be initially allocated M x N/np memory by fftw_mpi_local_size_2D (I have used this function from mpi because SIMD is efficient as per fftw3 manual).
initialisation: alloc_local=fftw_mpi_local_size_2d(M,N,MPI_COMM_WORLD,local_n,local_n_offset) pointer1=fftw_alloc_real(alloc_local) call c_f_pointer(pointer1,A[M,local_n])
At this point, each processor has a slab of A that is M x local_n=(N/np) size.
While doing a fourier transform: A(x,y) -> A(x,ky), here y is vertically downwards(not the MPI partitioned axis) in the array A. In fourier space I have to store M+2 x local_n number of elements (for a 1d real array of length M, M in fourier space has M+2 elements if we use this module from FFTW3 dfftw_execute_dft_r2c, ).
These fourier space operations I could do in dummy matrices in every processor independently.
There is one operation where I have to y fourier and x fourier cosine transform consecutively. To parallelise operations in the all fourier space, I want to gather my y fourier transformed arrays which are (M+2)xlocal_n size to M+2 x N bigger array and scatter them back again after a transpose so that y direction is the partitioned one. i.e( N x M+2 ) ----scatter---> (N x (M+2)/np) but each processor has been allocated only M x local_n addresses initially.
If I have M=N, then I still have (N x local_n + (2/np)) . I could resolve this by increasing allocated memory for 1 processor.
I don't want to start out with (N+2,N) and (N+2,local_n) because this will increase memory requirement for a lot of arrays and the above gymnastics has to be done only once per iteration.
