For loop optimization in c

Question

how can I optimize this for loop:

for (l = 1; l <= loop; l++) {
    for (i = 1; i < n; i++) {
        x[i] = z[i] * (y[i] - x[i - 1]);
    }
}

and how can I parallel original and optimized version of it by OpenMp?

@ThiruShetty: No.it is deferent : optimization and parallelization — AComputer
– AComputer, Commented Apr 12, 2018 at 5:44
This can be rewritten as a prefix/cumulative sum. I only know how to scale the prefix sum with with the number of memory controllers (NUMA), not the number of cores en.wikipedia.org/wiki/Prefix_sum#Parallel_algorithm. — Z boson
– Z boson, Commented Apr 13, 2018 at 8:29

dvhh · Accepted Answer · 2018-04-12 05:53:29Z

4

Assuming you want to parallelize the inner loop

for ( i = 1; i < n; ++i ) {
    x[i] = z[i] * ( y[i] - x[i - 1] );
}

I would suggest pre-computing the part that are not dependent on the previous loop. which is easier to parallelize.

double preComps [n];
#pragma omp parallel for
for( i = 1; i < n ; ++i ) {
    preComps[i] = z[i] * y[i];
}

// this loop is difficult to parallelize because of the data dependency on what was computed in the previous loop
for( i = 1; i < n ; ++i ) {
    x[i] = preComps[i] - z[i] * x[i - 1];
}

edited Apr 12, 2018 at 5:53

answered Apr 12, 2018 at 5:43

dvhh

4,73829 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Kami Kaze Over a year ago

As far as I can see you just made it more complex. The critical loop has the same complexity as before but you added another loop that maybe parallelized. Resulting in more memory usage and more computations.

Gilles Over a year ago

The whole process is (very likely to be) memory bound. Even if your pre-computation of some indexes in parallel were to be sped-up by its parallelization (which is yet to be proven), the number of memory references in the actual loops being equal as before, this one will be as costly as the initial one. All in all, your method will take more time and more resources than the initial.

Luis Colorado Over a year ago

@KamiKaze, in order to save time, you do have to parallelize as much as possible. The aproach leads to a win in almost 100% more speed, as seen in my answer. I'ts more compex, of course... but it saves time (50% approx)

Kami Kaze Over a year ago

@LuisColorado Given but only if there is something to parallelize, this does not change the critical path at all, save for a possible MAC operation. See what Gilles said.

Luis Colorado Over a year ago

The critical path (as shown in my answer) allows for a two thread of parallel calculations, so you get a 50% time savins for the whole thing. Not considering that the outer loop can be completely eliminated in almost all cases.

Luis Colorado · Accepted Answer · 2018-04-16 11:28:58Z

1

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

As you make no references to the outer loop control variable l and you don't even reference the same terms of the assignment and no lateral effects are made in the inner loop, running the inner loop is idempotent, there's no benefit of running it once or more than once, so a very good optimisation is to eliminate it completely :), as the following example shows:

pru.c

00001: #include <stdio.h>
00002: #include <stdlib.h>
00003: #define n 10
00004: #define loop 30


00005: void print(int x[], int y[], int z[])
00006: {
00007:      int i;
00008:      printf("%12s%12s%12s%12s\n","i", "x[]", "y[]", "z[]");
00009:      for(i = 0; i < n; i++)
00010:              printf("%12d%12d%12d%12d\n", i, x[i], y[i], z[i]);
00011: }
00012: int main()
00013: {
00014:      int x[n], y[n], z[n];
00015:      int i, l;
00016:      for(i = 0; i < n; i++) {
00017:              x[i] = rand();
00018:              y[i] = rand();
00019:              z[i] = rand();
00020:      }

print t the beginning

00021:      print(x, y, z);

next is the posted loop:

00022:      for (l = 1; l <= loop; l++) {
00023:              printf("iteration %d\n", l);

00024:         for (i = 1; i<n; i++) {
00025:              x[i] = z[i] * (y[i] - x[i - 1]);
00026:         }

and print it

00027:              print(x, y, z);

00028:      }

end of posted loop

00029: }

as you see, no difference of array contents between passes on the loop. Next is a run of the program to demonstrate:

initial contents:

$ a.out
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1  1969887315   140734212   940422543
           2   202055087   768218108   770072198
           3  1866991770  1647128879    83392682
           4  1421485336   148486083   229615973
           5   127561358   735081006    33063457
           6  1646757679   287085223  1793088605
           7   802182690   382151770  1848710666
           8  1486775472   115658218   394986197
           9   661076908  1786703631   864107022

first iteration:

iteration 1
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 2
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
iteration 3
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022

... and the iterations repeat until

iteration 30
           i         x[]         y[]         z[]
           0       33613   564950497  1097816498
           1 -1607135687   140734212   940422543
           2  1213242898   768218108   770072198
           3 -1987622590  1647128879    83392682
           4 -1113079323   148486083   229615973
           5  -327431319   735081006    33063457
           6   407021958   287085223  1793088605
           7  1996444744   382151770  1848710666
           8   500660170   115658218   394986197
           9   -84727866  1786703631   864107022
$ _

INNER LOOP

if you reorder the inner expression, you can get some benefit in the inner loop also, as

x[0]
 \----.
      |
x[1] <+- y[1], z[1]
  \---.
      |
x[2] <+- y[2], z[2]
  .
  .
  .
x[n-1]<+- y[n-1],z[n-1]
   \--.
      |
x[n] <+- y[n], z[n]

If you rearrange the expression as x[i] = z[i]*y[i] - z[i]*x[i-1], you can parallelize, all the calculations of z[i]*y[i], and also the calculation of z[i]*x[i-1] as soon as the value of x[i-1] is calculated, gaining more time in the calculation of the inner loop.

 thrd[0]   thrd[1]    thrd[2]      ... thrd[j]   ...  thrd[n]
============================================================
z[1]*x[0]  z[1]*y[1]     z[2]*y[2] ... z[j]*y[j] ... z[n-1]*y[n-1]
    |          |             |             |                |        
    \----------+-------.     |             |                |
           ,---'       |     |             |                |
           |           |     |             |                |
           V           V     |             |                |
x[1] = z[1]*y[1] - z[1]*x[0] |             |                |
  |                          |             |                |
  `--------------------.     |             |                |
                       |     |             |                |
           ,-----------+-----'             |                |
           |           |                   |                |
           V           V                   |                |
x[2] = z[2]*y[2] - z[2]*x[1]               |                |
  |                                        |                |
  `--------------------.                   |                |
           ,-----------+-------------------'                |
           |           |                                    |
          ...         ...                                   |
           V           V                                    |
x[j] = z[j]*y[j] - z[j]*x[j-1]                              |
...                                                         |                    
 |                                                          |
 `---------------------------.                              |
                             |                              |
               ,-------------+------------------------------'
               |             |
               V             V
x[n-1] = z[n-1]*y[n-1]-z[n-1]*x[n-2]

this can be efficiently calculated with a pool of two threads. Previously you had n-1 products and n-1 subtractions, now you have 2*n products and n-1 subtractions, calculated in parallel, so no savings in the end you get from this approach (and you get two threads working, thanks to KamiKaze that showed me the mistake)

considering ALIASING

As can be seen from the previous graph, calculations of inner loop only depend on x[0], y[0...n-1] and z[0...n-1], and the only dependence of crossed values is given by the expression x[1] = f(x[0],z[1],y[1]). If you check... if we alias x with z or with y, then the expression transforms into x[j] = f(x[j-1],x[j], y[j]) or x[j] = f(x[j-1],z[j],x[j]), and it makes the value of x[j] in general to depend on the previous value of x[j]. In those cases (x aliased with y or z, or both) the algorithm is not idempotent, and the external loop cannot be eliminated. In the case of only aliasing y with z the expression is x[j] = f(x[j-1], y[j]) (or x[j] = f(x[j-1], z[j])) so no dependency exist on previous values, and the algorithm is idempotent.

So, in conclusion, in case of allowing aliasing between x vector and any of y or z, the outer loop cannot be eliminated, and must be conserved. In case of aliasing of y and z the algorithm continues to be idempotent, and the outer loop is not necessary.

edited Apr 16, 2018 at 11:28

answered Apr 13, 2018 at 9:16

Luis Colorado

13.1k1 gold badge19 silver badges34 bronze badges

10 Comments

tim18 Over a year ago

As implied here, if the outer loop simply repeated the inner loop calculation (not clear that this is so), a compiler could cut the outer loop down to a single iteration, but with openmp parallelization such an optimization might be prevented.

Luis Colorado Over a year ago

@tim18, right! you can get even a worse resource waste, making several processors busy calculate a thing that is idempotent.

Kami Kaze Over a year ago

please don't post code with line numbers it makes it tedious to try your solution.

Kami Kaze Over a year ago

your code might produce undefined behaviour. Multiplying two intergers in the range of integer might lead to a value greater than int.

Kami Kaze Over a year ago

Your inner loop still has the same critical path as before. You have to wait for the previous iteration for the next calculation. You might have the result of your precalculated multiplacation but you still have to multiply and substract in the critical path which yields no benefit. The exception would be an MAC operation(see my answer)

|

Kami Kaze · Accepted Answer · 2018-04-16 08:51:25Z

0

The outer loop is just noise, it does the same all the time, as Luis stated in his answer. So we don't have to consider this (or can massivly parallize it, but doing it once would be enough)....

The inner loop depends on the previous cycle(x[i-1]).

A precomputation of z * y changing the inner loop to x[i] = preComps[i] - z[i] * x[i - 1]; (as proposed by dvhh) might yield a benefit, when a Multiply-Accumulate instruction is used, but this is implementation-specific and I don't know the gain of this.

If there are 0s in z[] then there would be a possibility to parallelize. Because for z[i] = 0 -> x[i] = 0 giving you the possibility to split the inner loop at this point as

x[i+1] will always be equal y[i+1] * z[i+1] which is known at any moment. Giving you an entry point for another loop.

edited Apr 16, 2018 at 8:51

answered Apr 12, 2018 at 7:14

Kami Kaze

2,08017 silver badges28 bronze badges

3 Comments

tim18 Over a year ago

Particularly in the original version, a declaration like float *__restrict x to assert no aliasing between x[] and y[] and z[] could help by permitting the loads of y and z to occur in parallel with computation of previous x values. As previously asserted, there is no evident value in openmp here.

Luis Colorado Over a year ago

@tim18, I have to check this, but the calculations on the inner loop depend only of the values of previous array values (for array x only), and neither x[0], x[0], or z[0] are touched by the inner loop... So, probably, even in the case of aliasing, running the inner loop twice has no effect on the output. But, at least for the moment, that's only a hypothesis. I guess that only in the case of array overlapping, you can get a difference.

Kami Kaze Over a year ago

@LuisColorado You are right about the outer loop I misjudged the impact of x[i-1]. x[0] never changes so the starting point is the same and all values in x except x[0] are not evaluated. (rewrote the comment as it had a few mistaktes)

Collectives™ on Stack Overflow

For loop optimization in c

3 Answers 3

5 Comments

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

pru.c

INNER LOOP

considering ALIASING

10 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

OUTER LOOP (no aliasing assumed, see below for aliasing permitted)

pru.c

INNER LOOP

considering ALIASING

10 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related