OpenMP: Having threads execute a for loop in order

Question

I'd like to run something like the following:

for (int index = 0; index < num; index++)

I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc... That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished. Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4. Thanks for any help!

Unclear. "in any particular ordering but always in this pattern" But twice you've given a specific ordering. What's the pattern? What is n? And most importantly : why? — Victor Eijkhout
– Victor Eijkhout, Commented Jun 1, 2022 at 18:43
Please provide an example of actual scheduling on 4 threads. It looks like you want the scheduling to be exactly 0,1,2,3... but multithreading prevent that. You can shedule the loop so the thread operate on closer values but you cannot garantee a specific order due to parallelism: if a thread runs faster the order will be broken. Synchronization and ordering often means less parallelism (if any....). — Jérôme Richard
– Jérôme Richard, Commented Jun 1, 2022 at 19:36
Sorry that it's not clear. I realized that I didn't convey my idea really well. I want to have my loops run in order as much as possible. So thread 0 will work on index = 0, thread 1 will work on index = 1, thread 2 index=2, and thread 3 index =3. Then whichever finishes first will start on index = 4, and whatever finishes next will work on index = 5. By n I meant some arbitrary value of index. — strugglingdevver
– strugglingdevver, Commented Jun 1, 2022 at 20:40

Shawn · Accepted Answer · 2022-06-01 22:22:08Z

1

You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:

#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>

int main()
{
  atomic_int chunk = 0;
  int num = 12;
  int nthreads = 4;
  
  omp_set_num_threads(nthreads);
  
#pragma omp parallel shared(chunk, num, nthreads)
  {
    for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
      printf("In index %d\n", index);
      fflush(stdout);
#pragma omp barrier

      // For illustrative purposes only; not needed in real code
#pragma omp single
      {
        puts("After barrier");
        fflush(stdout);
      }
    }
  }

  puts("Done");
  return 0;
}

One possible output:

$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done

edited Jun 1, 2022 at 22:22

answered Jun 1, 2022 at 20:55

Shawn

53.9k3 gold badges29 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

strugglingdevver Over a year ago

That's what I'd like to do, and I've tried it before and again just now. But I got an error that said "barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region". Do you know why that is?

Shawn Over a year ago

@strugglingdevver Yeah, I realized that after hitting submit. See new version.

Jérôme Richard Over a year ago

The code is very synchronous. Barrier are generally quite slow and known not to scale (though 4 thread is Ok). They prevent threads to do something in case of work imbalance. Additionally, omp single does an implicit barrier which means that 2 barriers are performed per iterations.

Shawn Over a year ago

@JérômeRichard Well, yes. OP seems to want something that requires a lot of waiting for other tasks to catch up instead of making full use of parallelism.

Shawn Over a year ago

@JérômeRichard But I just realized there's a way to avoid the omp single.

|

Gilles · Accepted Answer · 2022-06-01 20:36:45Z

1

I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".

If that's what you want, what about something like this:

int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
    int end = min( index_outer + nths, num );
    #pragma omp for
    for( int index = index_outer; index < end; index++ ) {
        // the loop body just as before
    } // there's a thread synchronization here
}

answered Jun 1, 2022 at 20:36

Gilles

9,5694 gold badges39 silver badges55 bronze badges

2 Comments

strugglingdevver Over a year ago

You understood it perfectly. Unfortunately, I'm looking for something a bit cleaner. I have this kind of solution already, I'm just looking for something with less overhead (although I know that maybe it's impossible). I'm working on an assignment to speed up a certain code by 3.3x speed, and currently, I'm stuck at 2.5x and looking for small optimizations.

Victor Eijkhout Over a year ago

@strugglingdevver Have you measured the overhead?

Collectives™ on Stack Overflow

OpenMP: Having threads execute a for loop in order

2 Answers 2

8 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related