Suppose you are processing a large data set using several cores in parallel. I am looking for the most memory-efficient way to break up the data among the processors.
Specifically, this would be for arm processors on a Mac (Apple Silicon), and the algorithm is processing-light, memory-bound, and the data are independent from each other (e.g. simple statistics, like a histogram or an average.) Suppose there are n cores available, and so you are splitting the proceessing among n threads, and there are L items (bytes/ints/etc.) to process. There are several possibilities (pseudo-code is simplified and ignores edge cases, etc.):
- Sectioned partitioning: Split the data into
nblocks of size ~L/n, and have each thread/core process each block in parallel. Each threadkwill have a loop looking like:
blockSize = L/n;
for (i = k*blockSize; i < (k+1)*blockSize; i++)
{
// Process data[i]
}
- Interleaved partitioning: Have each thread read the same data, with a stride of n items, each with a different offset, so each thread
kreads and processes thek, k+n, k+2n...'th items. Each threadkwill have a loop looking like:
for (i = k; i < L; i += n)
{
// Process data[i]
}
Now, Apple Silicon has a common memory and cache (L3), and each core has its own L1 and L2 cache. With sectioned partitioning, the central memory will have to feed n times more data at a time to the common cache, from n non-adjacent memory locations. With interleved partitioning, each core will utilize only 1/n of the data read from the central cache, and therefore will need to be fed n times as much from the central cache.
There's also a more complex possibility, of which the above two are end cases:
- Hybrid partitioning: Each thread/core processes a small amount of data at a time, between 1 item and L/n items, let's say (arbitrarily) 256 items. So each thread
kwill have a loop looking like:
chunkSize = 256;
assert (chunkSize >= 1 && chunkSize <= L/n);
for (i = k*chunkSize; i<L; i += n*chunkSize)
{
for (j=0; j<chunkSize; j++)
{
// process data[i+j]
}
}
Here the M3 cache reads one stream from the main memory in sequence, and the L3 cache feeds separate streams to n cores at the same time.
Do any of the three schemes have a significant performance advantage over the others, with regards to the memory system?
ncores are reading fromnseparate ranges of memory, starting atp+L/n,p+2L/n, ... etc.