1

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way of parallelizing for loops.

  begin_parallel_region( chunk_size=100 , num_proc=10 );

  for( int i=0 ; i<1000 ; i++ )
  {
       //some computation 
  }

  end_parallel_region();

The code above distributes computation inside the for loop to 10 slave MPI processors. Upon entering the parallel region, the chunk size and number of slave processors are provided. Upon leaving the parallel region, the MPI processors are synched and are put idle.

EDITED in response to High Performance Mark.

I have no intention to simulate the OpenMP's shared memory model. I propose this because I need it. I am developing a library that is required to build graphs from mathetical functions. In these mathetical functions, there often exist for loops like the one below.

 for( int i=0 ; i<n ; i++ )
 {
          s = s + sin(x[i]);
 }

So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.

I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.

Thanks.

2
  • Note that, the whole "charm" of OpenMP is guided scheduling. This is especially useful if the computations inside the for loop are irregular. Commented Aug 27, 2012 at 15:28
  • The only existing (to my knowledge) not-only-in-the-lab solution is Universal Parallel C, but it is a language extension and might not suit your needs. Commented Aug 28, 2012 at 9:07

3 Answers 3

6

There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.

I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?

In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.

Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.

EDIT

Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.

I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.

Sign up to request clarification or add additional context in comments.

Comments

4

Yes, you could do this, for specific tasks. But you shouldn't.

Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>

typedef struct state_t {
    int globaln;
    int localn;
    int *locals;
    int *offsets;
    double *localin;
    double *localout;
    double (*map)(double);
} state;

state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
    state *s = malloc(sizeof(state));
    s->globaln = n;
    s->map = map;

    /* figure out decomposition */

    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    s->locals  = malloc(size * sizeof(int));
    s->offsets = malloc(size * sizeof(int));

    s->offsets[0] = 0;

    for (int i=0; i<size; i++) {
        s->locals[i] = (n+i)/size;
        if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
    }

    /* allocate local arrays */
    s->localn   = s->locals[rank];
    s->localin  = malloc(s->localn*sizeof(double));
    s->localout = malloc(s->localn*sizeof(double));


    /* distribute */
    MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
                  s->localin, s->locals[rank], MPI_DOUBLE,
                  0, MPI_COMM_WORLD);

    return s;
}

double  end_parallel_mapandsum(state **s) {
    double localanswer=0., answer;

    /* sum up local answers */
    for (int i=0; i<((*s)->localn); i++) {
        localanswer += ((*s)->localout)[i];
    }

    /* and get global result.  Everyone gets answer */
    MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);

    free( (*s)->localin );
    free( (*s)->localout );
    free( (*s)->locals );
    free( (*s)->offsets );
    free( (*s) );

    return answer;
}


int main(int argc, char **argv) {
    int rank;
    double *inputs;
    double result;
    int n=100;
    const double pi=4.*atan(1.);

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0) {
        inputs = malloc(n * sizeof(double));
        for (int i=0; i<n; i++) {
            inputs[i] = 2.*pi/n*i;
        }
    }

    state *s=begin_parallel_mapandsum(inputs, n, sin);

    for (int i=0; i<s->localn; i++) {
        s->localout[i] = (s->map)(s->localin[i]);
    }

    result = end_parallel_mapandsum(&s);

    if (rank == 0) {
        printf("Calculated result: %lf\n", result);
        double trueresult = 0.;
        for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
        printf("True  result: %lf\n", trueresult);
    }

    MPI_Finalize();

}

That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.

To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.

It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just

 for( int i=0 ; i<localn ; i++ )
    {
          s = s + sin(x[i]);
    }

and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.

3 Comments

You put it so much better than I did.
Thanks, but I liked your answer, too; we deal with exactly this sort of question in our training classes pretty regularly so it's something I've tried to explain before.
Many Thanks for your brilliant illustration.
1

Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:

http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.