OpenMP: splitting loop based on NUMA

Question

I am running the following loop using, say, 8 OpenMP threads:

float* data;
int n;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
    DO SOMETHING WITH data[i]
}

Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7.

Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.

How do I achieve this with OpenMP?

Thank you

PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.

I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:

int n;
int i0 = 0;
int i1 = n / 2;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
    int nt = omp_get_thread_num();
    int j;
    #pragma omp critical
    {
        if ( nt < 4 ) {
            if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
            else              j = i1++; // of loop unless first half is finished
        }
        else {
            if ( i1 < n ) j = i1++;  // Second 4 threads process second half
            else          j = i0++;  // of loop unless second half is finished 
        }
    }

    DO SOMETHING WITH data[j]
}

Can you explain why you say "Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7."? — Z boson
– Z boson, Commented Jul 25, 2014 at 14:31
Because data is allocated in such way, that first half of it is close to one socket (where I run threads 0,1,2,3) and second half of it is close to another socket (where I run threads 4,5,6,7) — user2052436
– user2052436, Commented Jul 25, 2014 at 14:33
What is your OS and hardware and compiler? Linux? Two sockets Intel Xeon? Gcc? — Z boson
– Z boson, Commented Jul 25, 2014 at 14:34
@Zboson RHEL 6.3, 8-socket Xeon CPU E5-4640 (64 cores total). 1 TB memory. Example in the post is simplified. I need more than 2 groups of threads. Compiler: GCC 4.8.3 or latest Intel. — user2052436
– user2052436, Commented Jul 25, 2014 at 14:38
Are you sure you want schedule(dynamic,1) or do you want sechedule(static)? — Z boson
– Z boson, Commented Jul 25, 2014 at 14:43

Jonathan Dursi · Accepted Answer · 2014-07-25 16:21:26Z

6

Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:

#include <omp.h>
#include <stdio.h>

int main(int argc, char **argv) {

    const int ngroups=2;
    const int npergroup=4;
    const int ndata = 16;

    omp_set_nested(1);
    #pragma omp parallel for num_threads(ngroups)
    for (int i=0; i<ngroups; i++) {
        int start = (ndata*i+(ngroups-1))/ngroups;
        int end  = (ndata*(i+1)+(ngroups-1))/ngroups;    

        #pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
        for (int j=start; j<end; j++) {
            printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
        }
    }

    return 0;
}

Running this gives

$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15

But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.

edited Jul 25, 2014 at 16:21

answered Jul 25, 2014 at 15:16

Jonathan Dursi

51.1k10 gold badges131 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Z boson Over a year ago

That's a clever answer. I have not used omp_set_nested yet.

Jonathan Dursi Over a year ago

Thanks - once I finally understood the question, it mapped nicely onto this.

user2052436 Over a year ago

Thanks. I guess you could also use tasks in outer loop. Don't know if it makes a difference. I am also trying to understand omp teams construct (never used them before). Can this feature be used instead of nested parallelism?

Jonathan Dursi Over a year ago

Tasks or parallel for at the top level, it doesn't really matter - whatever makes it easier to read or write. The nice thing about the loop is that it generalizes easily to different number of top level numa nodes. Teams do refer to nested parallelism, although be careful - in OMP 4, teams refers to the accelerator (GPU/Phi) stuff.

user2052436 Over a year ago

@JonathanDursi I just implemeted your approach in my code and tested it on a 2-socket (16-core total) machine. With old code, execution time dropped from 55 seconds to 50 when going from 8 threads to 16. With NUMA-aware code, 16-thread test runs in 32 seconds!

|

Collectives™ on Stack Overflow

OpenMP: splitting loop based on NUMA

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related