cache-efficient partitioning for multithreaded processing in arm

Ask Question

Asked 4 months ago

Modified 4 months ago

Viewed 76 times

Suppose you are processing a large data set using several cores in parallel. I am looking for the most memory-efficient way to break up the data among the processors. Specifically, this would be for arm processors on a Mac (Apple Silicon), and the algorithm is processing-light, memory-bound, and the data are independent from each other (e.g. simple statistics, like a histogram or an average.) Suppose there are n cores available, and so you are splitting the proceessing among n threads, and there are L items (bytes/ints/etc.) to process. There are several possibilities (pseudo-code is simplified and ignores edge cases, etc.):

Sectioned partitioning: Split the data into n blocks of size ~L/n, and have each thread/core process each block in parallel. Each thread k will have a loop looking like:

blockSize = L/n;
for (i = k*blockSize; i < (k+1)*blockSize; i++)
{
    // Process data[i]
}

Interleaved partitioning: Have each thread read the same data, with a stride of n items, each with a different offset, so each thread k reads and processes the k, k+n, k+2n...'th items. Each thread k will have a loop looking like:

for (i = k; i < L; i += n)
{
    // Process data[i]
}

Now, Apple Silicon has a common memory and cache (L3), and each core has its own L1 and L2 cache. With sectioned partitioning, the central memory will have to feed n times more data at a time to the common cache, from n non-adjacent memory locations. With interleved partitioning, each core will utilize only 1/n of the data read from the central cache, and therefore will need to be fed n times as much from the central cache.

There's also a more complex possibility, of which the above two are end cases:

Hybrid partitioning: Each thread/core processes a small amount of data at a time, between 1 item and L/n items, let's say (arbitrarily) 256 items. So each thread k will have a loop looking like:

chunkSize = 256;
assert (chunkSize >= 1 && chunkSize <= L/n);
for (i = k*chunkSize; i<L; i += n*chunkSize)
{
    for (j=0; j<chunkSize; j++)
    {
        // process data[i+j]
    }
}

Here the M3 cache reads one stream from the main memory in sequence, and the L3 cache feeds separate streams to n cores at the same time.

Do any of the three schemes have a significant performance advantage over the others, with regards to the memory system?

edited Aug 1 at 23:08

asked Aug 1 at 21:14

Zzyzx

536 bronze badges

1

"With sectioned partitioning, the central memory will have to feed n times more data at a time to the common cache, from n non-adjacent memory locations" Why ? I do not think so. Besides, "M1", "M2" and "M3" are actually level 1 (L1), L2 and L3 cache, right? The later is the standard platform-independent naming.

Jérôme Richard
– Jérôme Richard

2025-08-01 22:27:01 +00:00
Commented Aug 1 at 22:27
Yes, corrected the M's to L's. As to the question, n cores are reading from n separate ranges of memory, starting at p+L/n, p+2L/n, ... etc.

Zzyzx
– Zzyzx

2025-08-01 23:12:45 +00:00
Commented Aug 1 at 23:12
1

On Apple Silicon, each CPU has its own L1 caches (separate I and D), but L2 caches are shared across a core complex. On iPhones you would simply have one P and one E complex, but with Macs this has gotten more complicated. I think my M2 Ultra has 6 complexes in total (and an M2 Max would have 3), but I'm not entirely sure. But you'll probably have to tune mainly for DRAM loads anyway, which would require knowledge of the DRAM bank and row sizes...

Siguza
– Siguza

2025-08-02 04:55:20 +00:00
Commented Aug 2 at 4:55
2

You may have to figure it out empirically for each hardware architecture. That is how FFTW fine tunes its cache aware high performance algorithm. It also depends a lot on how many writebacks to the main memory there will be by process_data. Sequential readonly access can be very fast but the pattern of writes can seriously affect performance with sequential write access being strongly favoured if you want maximum throughput. Nothing can beat doing the experiment.

Martin Brown
– Martin Brown

2025-08-02 07:49:56 +00:00
Commented Aug 2 at 7:49
1

There may still be a magic innermost chunk size where one partitioning method is faster. Historically it used to be a power of two length in bytes. But these days with fancy associative cache hardware it need not be. A first guess to probe would be 1/2, 1, and 2 times L1 cache size.

Martin Brown
– Martin Brown

2025-08-03 07:46:10 +00:00
Commented Aug 3 at 7:46

| Show 9 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

cache-efficient partitioning for multithreaded processing in arm

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked