A word "Parallelizing" sounds magically, but scheduling rules apply:
Your problem is not in spending efforts on trying to convert a pure SEQ-process into it's PAR-re-representation, but in handling the costs of doing so, if you indeed persist into going PAR at any cost.
m = size(A); %{
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
const A[] := | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | | M |
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
:
\
\
\
\
\
\
\
\
\
\
\
+---+---+---+---+---+ .. + .. +---+---+---+---+---+ .. +---+
var B[] := | 0 | 0 | 0 | 0 | 0 | : | 0 | 0 | 0 | 0 | 0 | | 0 |
+---+---+---+---+---+ .. : .. +---+---+---+---+---+ .. +---+ }%
%% : ^ :
%% : | :
for n = 1:m %% : | :
B(n+1) =( %% ====:===+ : .STO NEXT n+1
%% : :
%% v :
B(n)^2 %% : { FMA B, B, .GET LAST n ( in SEQ :: OK, local data, ALWAYS )
+ B(n) %% v B } ( in PAR :: non-local data. CSP + bcast + many distributed-caches invalidates )
+ B(n) * A(n) %% { FMA B, A,
+ A(n)^k %% ApK}
);
end
Once the SEQ-process data-dependency is recurrent ( having a need to re-use the LAST B(n-1) for an assignment of the NEXT B(n), any attempt to make such SEQ calculation work in PAR will have to introduce a system-wide communication of known values, before "new"-values could get computed only after the respective "previous" B(n-1) has been evaluated and assigned -- through the pure serial SEQ chain of recurrent evaluation, thus not before all the previous cell have been processed serially -- as the LAST piece is always needed for the NEXT step , ref. the "crossroads" in for()-loop iterator dependency-map ( having this, all the rest have to wait in a "queue", to become able do their two primitive .FMA-s + .STO result for the next one in the recurrency indoctrinated "queue" ).
Yes, one can "enforce" the formula to become PAR-executed, but the very costs of such LAST values being communicated "across" the PAR-execution fabric ( towards the NEXT ) is typically prohibitively expensive ( in terms of resources and accrued delays -- either damaging the SIMT-optimised scheduler latency-masking, or blocking all the threads until receiving their "neighbour"-assigned LAST-value that they rely on and cannot proceed without getting it first -- either of which effectively devastates any potential benefit from all the efforts invested into going PAR ).
Even just a pair of FMA-s is not enough code to justify add-on costs -- indeed an extremely small amount of work to do -- for all the PAR efforts.
Unless some very mathematically "dense" processing is being in place, all the additional costs do not get easily adjusted and such attempt to introduce a PAR-mode of computing exhibits nothing but a negative ( adverse ) effect, instead of any wished speedup. One ought, in all professional cases, express all add-on costs during the Proof-of-Concept phase ( a PoC ), before deciding whether any feasible PAR-approach is possible at all, and how to achieve a speedup of >> 1.0 x
Relying on just advertised theoretical GFLOPS and TFLOPS is a nonsense. Your actual GPU-kernel will never be able to repeat the advertised tests' performance figures ( unless you run exactly the same optimised layout and code, which one does not need, does one? ). One typically needs to compute one's own specific algorithmisation, that is related to one's problem domain, not willing to artificially align all the toy-problem elements so that the GPU-silicon will not have to wait for real data and can enjoy some tweaked cache/register based ILP-artifacts, practically not achievable in most of the real-world problem solutions ). If there is one step to recommend -- do always evaluate overhead-fair PoC first to see, if there exists any such chance for speedup, before diving any resources and investing time and money into prototyping detailed design & testing.
Recurrent and weak processing GPU kernel-payloads almost in every case will fight hard to get at least their additional overhead-times ( bidirectional data-transfer related ( H2D + D2H ) + kernel-code related loads ) adjusted.