0

I'd like to run something like the following:

for (int index = 0; index < num; index++)

I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc... That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished. Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4. Thanks for any help!

3
  • 1
    Unclear. "in any particular ordering but always in this pattern" But twice you've given a specific ordering. What's the pattern? What is n? And most importantly : why? Commented Jun 1, 2022 at 18:43
  • Please provide an example of actual scheduling on 4 threads. It looks like you want the scheduling to be exactly 0,1,2,3... but multithreading prevent that. You can shedule the loop so the thread operate on closer values but you cannot garantee a specific order due to parallelism: if a thread runs faster the order will be broken. Synchronization and ordering often means less parallelism (if any....). Commented Jun 1, 2022 at 19:36
  • Sorry that it's not clear. I realized that I didn't convey my idea really well. I want to have my loops run in order as much as possible. So thread 0 will work on index = 0, thread 1 will work on index = 1, thread 2 index=2, and thread 3 index =3. Then whichever finishes first will start on index = 4, and whatever finishes next will work on index = 5. By n I meant some arbitrary value of index. Commented Jun 1, 2022 at 20:40

2 Answers 2

1

You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:

#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>

int main()
{
  atomic_int chunk = 0;
  int num = 12;
  int nthreads = 4;
  
  omp_set_num_threads(nthreads);
  
#pragma omp parallel shared(chunk, num, nthreads)
  {
    for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
      printf("In index %d\n", index);
      fflush(stdout);
#pragma omp barrier

      // For illustrative purposes only; not needed in real code
#pragma omp single
      {
        puts("After barrier");
        fflush(stdout);
      }
    }
  }

  puts("Done");
  return 0;
}

One possible output:

$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done
Sign up to request clarification or add additional context in comments.

8 Comments

That's what I'd like to do, and I've tried it before and again just now. But I got an error that said "barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region". Do you know why that is?
@strugglingdevver Yeah, I realized that after hitting submit. See new version.
The code is very synchronous. Barrier are generally quite slow and known not to scale (though 4 thread is Ok). They prevent threads to do something in case of work imbalance. Additionally, omp single does an implicit barrier which means that 2 barriers are performed per iterations.
@JérômeRichard Well, yes. OP seems to want something that requires a lot of waiting for other tasks to catch up instead of making full use of parallelism.
@JérômeRichard But I just realized there's a way to avoid the omp single.
|
1

I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".

If that's what you want, what about something like this:

int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
    int end = min( index_outer + nths, num );
    #pragma omp for
    for( int index = index_outer; index < end; index++ ) {
        // the loop body just as before
    } // there's a thread synchronization here
}

2 Comments

You understood it perfectly. Unfortunately, I'm looking for something a bit cleaner. I have this kind of solution already, I'm just looking for something with less overhead (although I know that maybe it's impossible). I'm working on an assignment to speed up a certain code by 3.3x speed, and currently, I'm stuck at 2.5x and looking for small optimizations.
@strugglingdevver Have you measured the overhead?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.