I have a C++ number crunching program. The structure is:
a) data input, data preparation
b) "big" loop, uses global and local data (lots of different variables in both cases)
c) postprocess results and write data
The most intensive part is "b", which is basically a loop. I need to speedup the program in a cluster. 25 blades, 4 cores each. I wonder whether I could use here OpenMP and MPI, or if you can point me to tutorials, not general cases, but complex and "big" for loops.
Thanks