2

how can I optimize this for loop:

for (l = 1; l <= loop; l++) {
    for (i = 1; i < n; i++) {
        x[i] = z[i] * (y[i] - x[i - 1]);
    }
}

and how can I parallel original and optimized version of it by OpenMp?

6
  • where is l used in the inner loop ? Commented Apr 12, 2018 at 5:34
  • 1
    @dvhh: inner loops mus run "loop" times. Commented Apr 12, 2018 at 5:37
  • 1
    Possibly duplicate of algorithm-to-optimize-nested-loops Commented Apr 12, 2018 at 5:37
  • 1
    @ThiruShetty: No.it is deferent : optimization and parallelization Commented Apr 12, 2018 at 5:44
  • This can be rewritten as a prefix/cumulative sum. I only know how to scale the prefix sum with with the number of memory controllers (NUMA), not the number of cores en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithm. Commented Apr 13, 2018 at 8:29

3 Answers 3

4

Assuming you want to parallelize the inner loop

for ( i = 1; i < n; ++i ) {
    x[i] = z[i] * ( y[i] - x[i - 1] );
}

I would suggest pre-computing the part that are not dependent on the previous loop. which is easier to parallelize.

double preComps [n];
#pragma omp parallel for
for( i = 1; i < n ; ++i ) {
    preComps[i] = z[i] * y[i];
}

// this loop is difficult to parallelize because of the data dependency on what was computed in the previous loop
for( i = 1; i < n ; ++i ) {
    x[i] = preComps[i] - z[i] * x[i - 1];
}
Sign up to request clarification or add additional context in comments.

5 Comments

As far as I can see you just made it more complex. The critical loop has the same complexity as before but you added another loop that maybe parallelized. Resulting in more memory usage and more computations.
The whole process is (very likely to be) memory bound. Even if your pre-computation of some indexes in parallel were to be sped-up by its parallelization (which is yet to be proven), the number of memory references in the actual loops being equal as before, this one will be as costly as the initial one. All in all, your method will take more time and more resources than the initial.
@KamiKaze, in order to save time, you do have to parallelize as much as possible. The aproach leads to a win in almost 100% more speed, as seen in my answer. I'ts more compex, of course... but it saves time (50% approx)
@LuisColorado Given but only if there is something to parallelize, this does not change the critical path at all, save for a possible MAC operation. See what Gilles said.
The critical path (as shown in my answer) allows for a two thread of parallel calculations, so you get a 50% time savins for the whole thing. Not considering that the outer loop can be completely eliminated in almost all cases.
1

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

As you make no references to the outer loop control variable l and you don't even reference the same terms of the assignment and no lateral effects are made in the inner loop, running the inner loop is idempotent, there's no benefit of running it once or more than once, so a very good optimisation is to eliminate it completely :), as the following example shows:

pru.c

00001: #include <stdio.h>
00002: #include <stdlib.h>
00003: #define n 10
00004: #define loop 30


00005: void print(int x[], int y[], int z[])
00006: {
00007:      int i;
00008:      printf("%12s%12s%12s%12s\n","i", "x[]", "y[]", "z[]");
00009:      for(i = 0; i < n; i++)
00010:              printf("%12d%12d%12d%12d\n", i, x[i], y[i], z[i]);
00011: }
00012: int main()
00013: {
00014:      int x[n], y[n], z[n];
00015:      int i, l;
00016:      for(i = 0; i < n; i++) {
00017:              x[i] = rand();
00018:              y[i] = rand();
00019:              z[i] = rand();
00020:      }

print t the beginning

00021:      print(x, y, z);

next is the posted loop:

00022:      for (l = 1; l <= loop; l++) {
00023:              printf("iteration %d\n", l);

00024:         for (i = 1; i<n; i++) {
00025:              x[i] = z[i] * (y[i] - x[i - 1]);
00026:         }

and print it

00027:              print(x, y, z);

00028:      }

end of posted loop

00029: }

as you see, no difference of array contents between passes on the loop. Next is a run of the program to demonstrate:

initial contents:

$ a.out
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1  1969887315   140734212   940422543
           2   202055087   768218108   770072198
           3  1866991770  1647128879    83392682
           4  1421485336   148486083   229615973
           5   127561358   735081006    33063457
           6  1646757679   287085223  1793088605
           7   802182690   382151770  1848710666
           8  1486775472   115658218   394986197
           9   661076908  1786703631   864107022

first iteration:

iteration 1
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 2
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 3
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022

... and the iterations repeat until

iteration 30
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
$ _

INNER LOOP

if you reorder the inner expression, you can get some benefit in the inner loop also, as

x[0]
 \----.
      |
x[1] <+- y[1], z[1]
  \---.
      |
x[2] <+- y[2], z[2]
  .
  .
  .
x[n-1]<+- y[n-1],z[n-1]
   \--.
      |
x[n] <+- y[n], z[n]

If you rearrange the expression as x[i] = z[i]*y[i] - z[i]*x[i-1], you can parallelize, all the calculations of z[i]*y[i], and also the calculation of z[i]*x[i-1] as soon as the value of x[i-1] is calculated, gaining more time in the calculation of the inner loop.

 thrd[0]   thrd[1]    thrd[2]      ... thrd[j]   ...  thrd[n]
============================================================
z[1]*x[0]  z[1]*y[1]     z[2]*y[2] ... z[j]*y[j] ... z[n-1]*y[n-1]
    |          |             |             |                |        
    \----------+-------.     |             |                |
           ,---'       |     |             |                |
           |           |     |             |                |
           V           V     |             |                |
x[1] = z[1]*y[1] - z[1]*x[0] |             |                |
  |                          |             |                |
  `--------------------.     |             |                |
                       |     |             |                |
           ,-----------+-----'             |                |
           |           |                   |                |
           V           V                   |                |
x[2] = z[2]*y[2] - z[2]*x[1]               |                |
  |                                        |                |
  `--------------------.                   |                |
           ,-----------+-------------------'                |
           |           |                                    |
          ...         ...                                   |
           V           V                                    |
x[j] = z[j]*y[j] - z[j]*x[j-1]                              |
...                                                         |                    
 |                                                          |
 `---------------------------.                              |
                             |                              |
               ,-------------+------------------------------'
               |             |
               V             V
x[n-1] = z[n-1]*y[n-1]-z[n-1]*x[n-2]

this can be efficiently calculated with a pool of two threads. Previously you had n-1 products and n-1 subtractions, now you have 2*n products and n-1 subtractions, calculated in parallel, so no savings in the end you get from this approach (and you get two threads working, thanks to KamiKaze that showed me the mistake)

considering ALIASING

As can be seen from the previous graph, calculations of inner loop only depend on x[0], y[0...n-1] and z[0...n-1], and the only dependence of crossed values is given by the expression x[1] = f(x[0],z[1],y[1]). If you check... if we alias x with z or with y, then the expression transforms into x[j] = f(x[j-1],x[j], y[j]) or x[j] = f(x[j-1],z[j],x[j]), and it makes the value of x[j] in general to depend on the previous value of x[j]. In those cases (x aliased with y or z, or both) the algorithm is not idempotent, and the external loop cannot be eliminated. In the case of only aliasing y with z the expression is x[j] = f(x[j-1], y[j]) (or x[j] = f(x[j-1], z[j])) so no dependency exist on previous values, and the algorithm is idempotent.

So, in conclusion, in case of allowing aliasing between x vector and any of y or z, the outer loop cannot be eliminated, and must be conserved. In case of aliasing of y and z the algorithm continues to be idempotent, and the outer loop is not necessary.

10 Comments

As implied here, if the outer loop simply repeated the inner loop calculation (not clear that this is so), a compiler could cut the outer loop down to a single iteration, but with openmp parallelization such an optimization might be prevented.
@tim18, right! you can get even a worse resource waste, making several processors busy calculate a thing that is idempotent.
please don't post code with line numbers it makes it tedious to try your solution.
your code might produce undefined behaviour. Multiplying two intergers in the range of integer might lead to a value greater than int.
Your inner loop still has the same critical path as before. You have to wait for the previous iteration for the next calculation. You might have the result of your precalculated multiplacation but you still have to multiply and substract in the critical path which yields no benefit. The exception would be an MAC operation(see my answer)
|
0

The outer loop is just noise, it does the same all the time, as Luis stated in his answer. So we don't have to consider this (or can massivly parallize it, but doing it once would be enough)....

The inner loop depends on the previous cycle(x[i-1]).

A precomputation of z * y changing the inner loop to x[i] = preComps[i] - z[i] * x[i - 1]; (as proposed by dvhh) might yield a benefit, when a Multiply-Accumulate instruction is used, but this is implementation-specific and I don't know the gain of this.

If there are 0s in z[] then there would be a possibility to parallelize. Because for z[i] = 0 -> x[i] = 0 giving you the possibility to split the inner loop at this point as

x[i+1] will always be equal y[i+1] * z[i+1] which is known at any moment. Giving you an entry point for another loop.

3 Comments

Particularly in the original version, a declaration like float *__restrict x to assert no aliasing between x[] and y[] and z[] could help by permitting the loads of y and z to occur in parallel with computation of previous x values. As previously asserted, there is no evident value in openmp here.
@tim18, I have to check this, but the calculations on the inner loop depend only of the values of previous array values (for array x only), and neither x[0], x[0], or z[0] are touched by the inner loop... So, probably, even in the case of aliasing, running the inner loop twice has no effect on the output. But, at least for the moment, that's only a hypothesis. I guess that only in the case of array overlapping, you can get a difference.
@LuisColorado You are right about the outer loop I misjudged the impact of x[i-1]. x[0] never changes so the starting point is the same and all values in x except x[0] are not evaluated. (rewrote the comment as it had a few mistaktes)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.