nested loops, inner loop parallelization, reusing threads

Question

Disclaimer: following example is just an dummy example to quickly understand the problem. If you are thinking about real world problem, think anything dynamic programming.

The problem: We have an n*m matrix, and we want to copy elements from previous row as in the following code:

for (i = 1; i < n; i++)
    for (j = 0; j < m; j++)
        x[i][j] = x[i-1][j];

Approach: Outer loop iterations have to be executed in order, they would be executed sequentially. Inner loop can be parallelized. We would want to minimize overhead of creating and killing threads, so we would want to create team of threads just once, however, this seems like an impossible task in OpenMP.

#pragma omp parallel private(j)
{
   for (i = 1; i < n; i++)
   {   
      #pragma omp for scheduled(dynamic)
      for (j = 0; j < m; j++)
         x[i][j] = x[i-1][j];
   }
}

When we apply ordered option on the outer loop, the code will be executed sequential way, so there will be no performance gain. I am looking to solution for the scenario above, even if I had to use some workaround.

I am adding my actual code. This is is actually slower than seq. version. Please review:

/* load input */
for (i = 1; i <= n; i++)
    scanf ("%d %d", &in[i][W], &in[i][V]);

/* init */
for (i = 0; i <= wc; i++)
    a[0][i] = 0;

/* compute */
#pragma omp parallel private(i,w)
{
    for(i = 1; i <= n; ++i) // 1 000 000
    {
        j=i%2;
        jn = j == 1 ? 0 : 1;

        #pragma omp for
        for(w = 0; w <= in[i][W]; w++) // 1000
            a[j][w] = a[jn][w];

        #pragma omp for
        for(w = in[i][W]+1; w <= wc; w++) // 350 000
            a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]);
    }
}

As for measuring, I am using something like this:

double t;
t = omp_get_wtime();
// ...
t = omp_get_wtime() - t;

If all you're doing is copying, it's not clear you'll get a great deal of benefit from parallelization, as you'll be limited by memory bandwidth. — Oliver Charlesworth
– Oliver Charlesworth, Commented Dec 7, 2014 at 12:06
clearly, it's just an example. Think of dynamic programming... — notnull
– notnull, Commented Dec 7, 2014 at 12:08
How much does the overhead contributes to the total time? In other words did you measure before optimizing? — 2501
– 2501, Commented Dec 7, 2014 at 12:12
@2501 It seems the overhead makes quiet enough, since the performance gain of parallel for used directly on the inner loop gives weak results compared to theory. — notnull
– notnull, Commented Dec 7, 2014 at 12:19
@notnull Are you only asking about performance and not about correctness? — levzettelin
– levzettelin, Commented Dec 7, 2014 at 21:17

notnull · Accepted Answer · 2014-12-08 12:40:35Z

To sum up the parallelization in OpenMP for this particular case: It is not worth it.

Why? Operations in the inner loops are simple. Code was compiled with -O3, so max() call was probably substituted with the code of function body. Overhead in implicit barrier is probably high enough, to compensate the performance gain, and overall overhead is high enough to make the parallel code even slower than the sequential code was. I also found out, there is no real performance gain in such construct:

#pragma omp parallel private(i,j)
{ 
   for (i = 1; i < n; i++)
   {   
      #pragma omp for
      for (j = 0; j < m; j++)
         x[i][j] = x[i-1][j];
   }
}

because it's performance is similar to this one

for (i = 1; i < n; i++)
{   
   #pragma omp parallel for private(j)
   for (j = 0; j < m; j++)
      x[i][j] = x[i-1][j];
}

thanks to built-in thread reusing in GCC libgomp, according to this article: http://bisqwit.iki.fi/story/howto/openmp/

Since the outer loop cannot be paralellized (without ordered option) it looks there is no way to significantly improve performance of the program in question using OpenMP. If someone feels I did something wrong, and it is possible, I'll be glad to see and test the solution.

Collectives™ on Stack Overflow

nested loops, inner loop parallelization, reusing threads

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related