9

I am running the following loop using, say, 8 OpenMP threads:

float* data;
int n;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
    DO SOMETHING WITH data[i]
}

Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7.

Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.

How do I achieve this with OpenMP?

Thank you

PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.

I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:

int n;
int i0 = 0;
int i1 = n / 2;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
    int nt = omp_get_thread_num();
    int j;
    #pragma omp critical
    {
        if ( nt < 4 ) {
            if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
            else              j = i1++; // of loop unless first half is finished
        }
        else {
            if ( i1 < n ) j = i1++;  // Second 4 threads process second half
            else          j = i0++;  // of loop unless second half is finished 
        }
    }

    DO SOMETHING WITH data[j]
}
11
  • Can you explain why you say "Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7."? Commented Jul 25, 2014 at 14:31
  • Because data is allocated in such way, that first half of it is close to one socket (where I run threads 0,1,2,3) and second half of it is close to another socket (where I run threads 4,5,6,7) Commented Jul 25, 2014 at 14:33
  • What is your OS and hardware and compiler? Linux? Two sockets Intel Xeon? Gcc? Commented Jul 25, 2014 at 14:34
  • @Zboson RHEL 6.3, 8-socket Xeon CPU E5-4640 (64 cores total). 1 TB memory. Example in the post is simplified. I need more than 2 groups of threads. Compiler: GCC 4.8.3 or latest Intel. Commented Jul 25, 2014 at 14:38
  • 1
    Are you sure you want schedule(dynamic,1) or do you want sechedule(static)? Commented Jul 25, 2014 at 14:43

1 Answer 1

6

Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:

#include <omp.h>
#include <stdio.h>

int main(int argc, char **argv) {

    const int ngroups=2;
    const int npergroup=4;
    const int ndata = 16;

    omp_set_nested(1);
    #pragma omp parallel for num_threads(ngroups)
    for (int i=0; i<ngroups; i++) {
        int start = (ndata*i+(ngroups-1))/ngroups;
        int end  = (ndata*(i+1)+(ngroups-1))/ngroups;    

        #pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
        for (int j=start; j<end; j++) {
            printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
        }
    }

    return 0;
}

Running this gives

$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15

But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

Sign up to request clarification or add additional context in comments.

6 Comments

That's a clever answer. I have not used omp_set_nested yet.
Thanks - once I finally understood the question, it mapped nicely onto this.
Thanks. I guess you could also use tasks in outer loop. Don't know if it makes a difference. I am also trying to understand omp teams construct (never used them before). Can this feature be used instead of nested parallelism?
Tasks or parallel for at the top level, it doesn't really matter - whatever makes it easier to read or write. The nice thing about the loop is that it generalizes easily to different number of top level numa nodes. Teams do refer to nested parallelism, although be careful - in OMP 4, teams refers to the accelerator (GPU/Phi) stuff.
@JonathanDursi I just implemeted your approach in my code and tested it on a 2-socket (16-core total) machine. With old code, execution time dropped from 55 seconds to 50 when going from 8 threads to 16. With NUMA-aware code, 16-thread test runs in 32 seconds!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.