Parallelize Nested Loops using OpenMP

Question

I tried to parallelize the nested loop using OpenMP, but I am not sure if this is the correct way to do that. Here is the part of the code having nested loops. This is just a generic code. I am giving noofrecords as 50k, it takes a lot of time even after parallelization. Can someone suggest any better idea to parallelize the code. I am just parallelizing the outer loop in the below code.

int ctd=0;
#pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd)
for(int j=0; j <noofrecords; j++)
{
    for( int k=0; k<noofrecords; k++)
    {
        for( int l=0; l<noofrecords; l++)
        {
            if(condition)
            {
D[ctd].a1=A[i].a1;
ctd++;
              }}}}

Be aware that your code is wrong - you cannot just increment a shared index like cdt. See stackoverflow.com/questions/43057426/… — Zulan
– Zulan, Commented Nov 7, 2018 at 15:30
Zulan is correct. Because ctd is a shared variable, your algorithm is fundamentally sequential. There's not going to be any way to parallelize this effectively if condition holds true many times. Moreover, if you later need to know exactly which index of D corresponded to which record, then there would be no way to parallelize this algorithm at all. — NoseKnowsAll
– NoseKnowsAll, Commented Nov 7, 2018 at 21:18
@NoseKnowsAll The condition which i have is used to join the three tables, A,B and C based on join attributes and put the result in table D. Counter is used to provide me the matched number of records in the resultant table. Thats why, I feel the counter should be shared. please let me know your comments. — Nancy
– Nancy, Commented Nov 9, 2018 at 21:09
@Nancy ah well then in that case you can rewrite the entire thing by keeping separate D arrays for each processor. That way, the work in this for loop in embarrassingly parallel. Then at the end, you can sequentially merge the arrays into one large D array. — NoseKnowsAll
– NoseKnowsAll, Commented Nov 10, 2018 at 3:31
@NoseKnowsAll I have one more question. I am running my code on cluster, its running fine but on my local machine. its giving memory issue. Since its running without any memory issue on cluster, do i need to worry why its not running on local machine. — Nancy
– Nancy, Commented Nov 10, 2018 at 6:44

L.C. · Accepted Answer · 2018-11-07 14:58:49Z

2

You can use the collapse clause, in your case you have 3 consecutive for loops so it would be something like: #pragma omp parallel for default(none), private(j,k,l), shared(A,B,C,D,ctd) collapse(3)

It will work if the for loops are consecutive and the code is in the inner-most loop (it's the case in the code you posted). It noofrecords is much bigger than your max thread count, the speedup won't be impressive. If it's slow even in parallel it probably means the bottleneck is not your processing power (more probably it's the ram already working at 100%).

Also, I'm not so sure you really want that private(j,k,l)...

edited Nov 7, 2018 at 14:58

answered Nov 7, 2018 at 14:51

L.C.

1,14510 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Z boson · Accepted Answer · 2018-11-09 13:41:41Z

create a temporary array of a1 of type D.a1 with a number of elements equal to the maximum value expected of ctd.
create a temporary array a2 of a1 for each thread.
Fill a2 in parallel and use ctd2 to count the size of a2
Fill the array a1 in order from a2 and add ctd2 to ctd
Write to D.a1 in parallel from a1

Something like this

int ctd=0;
double *a1 = malloc(sizeof *a1 * N);                       //Step 1
#pragma omp parallel
{
  int ctd2 = 0;
  double *a2 = malloc(sizeof *a2 * N);                     //step 2

  #pragma omp for nowait
  for(int j=0; j<noofrecords; j++)
  for(int k=0; k<noofrecords; k++)
  for(int l=0; l<noofrecords; l++)
    if(condition) a2[ctd2++] = A[i].a1;                    //step 3

  #pragma omp for schedule(static) ordered
  for(int i=0; i<omp_get_num_threads(); i++)
    #pragma omp ordered
    memcpy(&a1[ctd], a2, sizeof *a1 * ctd2), ctd += ctd2;  //step 4

  #pragma omp for
  for(int j=0; j<ctd; j++) D[j].a1 = a1[j];                // step 5
  free(a2);
}
free(a1);

N should be the maximum size you expect ctd to have. One memory inefficieny in this method is that a2 is allocated with size N as well which may be too large. A dynamic vector like std::vector in C++ or GArray in glib would be more memory efficient.

Collectives™ on Stack Overflow

Parallelize Nested Loops using OpenMP

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related