Parallelizing sequential for-loop for GPU

Question

I have a for-loop where the current index of a vector depends on the previous indices that I am trying to parallelize for a GPU in MATLAB.

A is an nx1 known vector
B is an nx1 output vector that is initialized to zeros.

The code is as follows:

for n = 1:size(A)
    B(n+1) = B(n) + A(n)*B(n) + A(n)^k + B(n)^2
end

I have looked at this similar question and tried to find a simple closed form for the recurrence relation, but couldn't find one.

I could do a prefix sum as mentioned in the first link over the A(n)^k term, but I was hoping there would be another method to speed up the loop.

Any advice is appreciated!

P.S. My real code involves 3D arrays that index and sum along 2D slices, but any help for the 1D case should transfer to a 3D scaling.

will this help?

Hojo.Timberwolf
– Hojo.Timberwolf

2017-09-28 02:37:00 +00:00
Commented Sep 28, 2017 at 2:37 — Hojo.Timberwolf
– Hojo.Timberwolf, Commented Sep 28, 2017 at 2:37

user3666197 · Accepted Answer · 2017-09-30 08:13:52Z

A word "Parallelizing" sounds magically, but scheduling rules apply:

Your problem is not in spending efforts on trying to convert a pure SEQ-process into it's PAR-re-representation, but in handling the costs of doing so, if you indeed persist into going PAR at any cost.

m = size(A);                                                                  %{

             +---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
const A[] := | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D |    | M |
             +---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
                           :
                            \
                             \
                              \
                               \
                                \
                                 \
                                  \
                                   \
                                    \
                                     \
                                      \
             +---+---+---+---+---+ ..  +  .. +---+---+---+---+---+ .. +---+
  var B[] := | 0 | 0 | 0 | 0 | 0 |     :     | 0 | 0 | 0 | 0 | 0 |    | 0 |
             +---+---+---+---+---+ ..  :  .. +---+---+---+---+---+ .. +---+   }%
                    %%     :   ^       : 
                    %%     :   |       :
      for   n = 1:m %%     :   |       : 
          B(n+1) =( %% ====:===+       :                       .STO NEXT n+1
                    %%     :           : 
                    %%     v           :                    
                         B(n)^2  %%    :        { FMA B, B,    .GET LAST n ( in SEQ :: OK, local data, ALWAYS )
                  +      B(n)    %%    v                 B  }              ( in PAR :: non-local data. CSP + bcast + many distributed-caches invalidates )
                  +      B(n) *     A(n)    %%  { FMA B, A,
                  +                 A(n)^k  %%           ApK}
                    );
      end

Once the SEQ-process data-dependency is recurrent ( having a need to re-use the LAST B(n-1) for an assignment of the NEXT B(n), any attempt to make such SEQ calculation work in PAR will have to introduce a system-wide communication of known values, before "new"-values could get computed only after the respective "previous" B(n-1) has been evaluated and assigned -- through the pure serial SEQ chain of recurrent evaluation, thus not before all the previous cell have been processed serially -- as the LAST piece is always needed for the NEXT step , ref. the "crossroads" in for()-loop iterator dependency-map ( having this, all the rest have to wait in a "queue", to become able do their two primitive .FMA-s + .STO result for the next one in the recurrency indoctrinated "queue" ).

Yes, one can "enforce" the formula to become PAR-executed, but the very costs of such LAST values being communicated "across" the PAR-execution fabric ( towards the NEXT ) is typically prohibitively expensive ( in terms of resources and accrued delays -- either damaging the SIMT-optimised scheduler latency-masking, or blocking all the threads until receiving their "neighbour"-assigned LAST-value that they rely on and cannot proceed without getting it first -- either of which effectively devastates any potential benefit from all the efforts invested into going PAR ).

Even just a pair of FMA-s is not enough code to justify add-on costs -- indeed an extremely small amount of work to do -- for all the PAR efforts.

Unless some very mathematically "dense" processing is being in place, all the additional costs do not get easily adjusted and such attempt to introduce a PAR-mode of computing exhibits nothing but a negative ( adverse ) effect, instead of any wished speedup. One ought, in all professional cases, express all add-on costs during the Proof-of-Concept phase ( a PoC ), before deciding whether any feasible PAR-approach is possible at all, and how to achieve a speedup of >> 1.0 x

Relying on just advertised theoretical GFLOPS and TFLOPS is a nonsense. Your actual GPU-kernel will never be able to repeat the advertised tests' performance figures ( unless you run exactly the same optimised layout and code, which one does not need, does one? ). One typically needs to compute one's own specific algorithmisation, that is related to one's problem domain, not willing to artificially align all the toy-problem elements so that the GPU-silicon will not have to wait for real data and can enjoy some tweaked cache/register based ILP-artifacts, practically not achievable in most of the real-world problem solutions ). If there is one step to recommend -- do always evaluate overhead-fair PoC first to see, if there exists any such chance for speedup, before diving any resources and investing time and money into prototyping detailed design & testing.

Recurrent and weak processing GPU kernel-payloads almost in every case will fight hard to get at least their additional overhead-times ( bidirectional data-transfer related ( H2D + D2H ) + kernel-code related loads ) adjusted.

Collectives™ on Stack Overflow

Parallelizing sequential for-loop for GPU

1 Answer 1

A word "Parallelizing" sounds magically, but scheduling rules apply:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

A word "Parallelizing" sounds magically, but scheduling rules apply:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related