How to parallelize a loop?

Question

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.

for(i=2;i<N;i++)
    for(j=2;j<N;j++)
         A[i,j] =A[i-2,j] +A[i,j-2];

Code:

int const N = 10;
int arr[N][N];

#pragma omp parallel for
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        arr[i][j] = 1;

#pragma omp parallel for 
for (int i = 2; i < N; i++)
    for (int j = 2; j < N; j++)
    {
        arr[i][j] = arr[i-2][j] +arr[i][j-2];
    }

for (int i = 0; i < N; i++)
{
    for (int j = 0; j < N; j++)
        printf_s("%d     ",arr[i][j]);
    printf("\n");
}

Do you have any suggestions how I can do it? Thank you!

Do you really mean i<N in your inner loop termination condition? — Vicky
– Vicky, Commented Sep 5, 2013 at 10:28
Considering you're modifying the container you're reading from, you have to be very careful about the structure of your parallel operations compared to a serial operation. I'd read up on a good book on multithreading in C++, this is rather too broad for a general SO question. — The Forest And The Trees
– The Forest And The Trees, Commented Sep 5, 2013 at 10:30
@Vicky, Sorry I have mistake in inner loop. Of course, it must be (j < N) — nonameg
– nonameg, Commented Sep 5, 2013 at 10:34

alexbuisson · Accepted Answer · 2013-09-05 10:53:29Z

3

serial and parallel run will give different. result because in

#pragma omp parallel for 
for (int i = 2; i < N; i++)
    for (int j = 2; j < N; j++)
    {
        arr[i][j] = arr[i-2][j] +arr[i][j-2];
    }
    .....

you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!

answered Sep 5, 2013 at 10:53

alexbuisson

8,6073 gold badges36 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

nonameg Over a year ago

But how can I fix it? I haven't ideas.

High Performance Mark · Accepted Answer · 2013-09-05 11:22:47Z

3

This

#pragma omp parallel for 
for (int i = 2; i < N; i++)
    for (int j = 2; j < N; j++)
    {
        arr[i][j] = arr[i-2][j] +arr[i][j-2];
    }

is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.

You have a number of options to fix this:

You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.

You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)

for ...
{
    old = 0
    new = 1

    arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];

    old = 1
    new = 0
}

Of course, this second approach trades space for time but that's often a reasonable trade-off.

You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.

Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.

If any other options occur to me I'll edit them in.

edited Sep 5, 2013 at 11:22

answered Sep 5, 2013 at 11:02

High Performance Mark

78.7k7 gold badges109 silver badges168 bronze badges

3 Comments

nonameg Over a year ago

Thank you, I'll try to do this things

alexbuisson Over a year ago

As the explanation was promising, I tried to implement it, but I fail, @High Performance Mark can you show how the old/new logic have to be use with OpenMP, is it shared data ? doesn't means that they need to be swap in a critical section ?

High Performance Mark Over a year ago

Well the compiler will make i private if it's the iteration variable on the parallelised loop, then N should be shared and j private. arr will also be shared but new and old should be private so that each thread can proceed independently of the others.

Arch D. Robison · Accepted Answer · 2013-09-05 22:48:22Z

To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.

Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).

A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:

if( too small to be worth parallelizing ) {
    do serially
} else {
    // Recursively:
    Do upper left quadrant
    Do lower left and upper right quadrants in parallel
    Do lower right quadrant
}

As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.

zam · Accepted Answer · 2013-09-06 19:45:09Z

If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))

Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.

To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

Collectives™ on Stack Overflow

How to parallelize a loop?

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related