I have a Python application that needs to load the same large array (~4 GB) and do a perfectly parallel function on chunks of this array. The array starts off saved to disk.
I typically run this application on a cluster computer with something like, say, 10 nodes, each node of which has 8 compute cores and a total RAM of around 32GB.
The easiest approach (which doesn't work) is to do
n=80 mpi4py. The reason it doesn't work is that
each MPI core will load the 4GB map, and this will exhaust
the 32GB of RAM resulting in a MemoryError.
An alternative is that rank=0 is the only process that loads
the 4GB array, and it farms out chunks of the array to the rest
of the MPI cores -- but this approach is slow because of network
bandwidth issues.
The best approach would be if only 1 core in each node loads
the 4GB array and this array is made available as shared memory
(through multiprocessing?) for the remaining 7 cores on each
node.
How can I achieve this? How can I have MPI be aware of nodes
and make it coordinate with multiprocessing?