1

I tried to parallelize the nested loop using OpenMP, but I am not sure if this is the correct way to do that. Here is the part of the code having nested loops. This is just a generic code. I am giving noofrecords as 50k, it takes a lot of time even after parallelization. Can someone suggest any better idea to parallelize the code. I am just parallelizing the outer loop in the below code.

int ctd=0;
#pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd)
for(int j=0; j <noofrecords; j++)
{
    for( int k=0; k<noofrecords; k++)
    {
        for( int l=0; l<noofrecords; l++)
        {
            if(condition)
            {
D[ctd].a1=A[i].a1;
ctd++;
              }}}}
6
  • 1
    Be aware that your code is wrong - you cannot just increment a shared index like cdt. See stackoverflow.com/questions/43057426/… Commented Nov 7, 2018 at 15:30
  • Zulan is correct. Because ctd is a shared variable, your algorithm is fundamentally sequential. There's not going to be any way to parallelize this effectively if condition holds true many times. Moreover, if you later need to know exactly which index of D corresponded to which record, then there would be no way to parallelize this algorithm at all. Commented Nov 7, 2018 at 21:18
  • @NoseKnowsAll The condition which i have is used to join the three tables, A,B and C based on join attributes and put the result in table D. Counter is used to provide me the matched number of records in the resultant table. Thats why, I feel the counter should be shared. please let me know your comments. Commented Nov 9, 2018 at 21:09
  • @Nancy ah well then in that case you can rewrite the entire thing by keeping separate D arrays for each processor. That way, the work in this for loop in embarrassingly parallel. Then at the end, you can sequentially merge the arrays into one large D array. Commented Nov 10, 2018 at 3:31
  • @NoseKnowsAll I have one more question. I am running my code on cluster, its running fine but on my local machine. its giving memory issue. Since its running without any memory issue on cluster, do i need to worry why its not running on local machine. Commented Nov 10, 2018 at 6:44

2 Answers 2

2

You can use the collapse clause, in your case you have 3 consecutive for loops so it would be something like: #pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd) collapse(3)

It will work if the for loops are consecutive and the code is in the inner-most loop (it's the case in the code you posted). It noofrecords is much bigger than your max thread count, the speedup won't be impressive. If it's slow even in parallel it probably means the bottleneck is not your processing power (more probably it's the ram already working at 100%).

Also, I'm not so sure you really want that private(j,k,l)...

Sign up to request clarification or add additional context in comments.

Comments

0
  1. create a temporary array of a1 of type D.a1 with a number of elements equal to the maximum value expected of ctd.
  2. create a temporary array a2 of a1 for each thread.
  3. Fill a2 in parallel and use ctd2 to count the size of a2
  4. Fill the array a1 in order from a2 and add ctd2 to ctd
  5. Write to D.a1 in parallel from a1

Something like this

int ctd=0;
double *a1 = malloc(sizeof *a1 * N);                       //Step 1
#pragma omp parallel
{
  int ctd2 = 0;
  double *a2 = malloc(sizeof *a2 * N);                     //step 2

  #pragma omp for nowait
  for(int j=0; j<noofrecords; j++)
  for(int k=0; k<noofrecords; k++)
  for(int l=0; l<noofrecords; l++)
    if(condition) a2[ctd2++] = A[i].a1;                    //step 3

  #pragma omp for schedule(static) ordered
  for(int i=0; i<omp_get_num_threads(); i++)
    #pragma omp ordered
    memcpy(&a1[ctd], a2, sizeof *a1 * ctd2), ctd += ctd2;  //step 4

  #pragma omp for
  for(int j=0; j<ctd; j++) D[j].a1 = a1[j];                // step 5
  free(a2);
}
free(a1);

N should be the maximum size you expect ctd to have. One memory inefficieny in this method is that a2 is allocated with size N as well which may be too large. A dynamic vector like std::vector in C++ or GArray in glib would be more memory efficient.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.