A merge of sorted arrays can be split into n independent, parallelizable tasks. The general idea is you pick values from the input data and merge the subarray elements between those values in their own thread. The steps are:
- Pick n - 1 values from the unsorted input data
- Merge sort normally until the subarrays are big enough that it's worth merging in multiple threads.
- For each subarray you're merging, for each value you picked, find the index of the last occurrence of that value if it exists, or the index of the first value higher than it if it does not exist.
- Sort each list of indexes and put the beginning and end of the subarray in the list.
- You can now merge the slices of each subarray from index i, inclusive, up to index i+1, exclusive, and write that result to disk independent of the rest of the merging.
Here's a concrete example:
unsorted input: [a,w,d,y,u,p,l,s,c,e,g,h]
pick values: g and p
subarrays in last step of mergesort: a = [a,d,p,u,w,y] and b = [c,e,g,h,l,s]
indexes of picked values (or next highest) in a: a_ind=[2, 2]
indexes of picked values (or next highest) in b: b_ind=[2, 5]
add ends of array to indexes: a_ind=[0,2,2,6], b_ind=[0,2,5,6]
merge each of these in separate threads:
[a,d] with [c,e]
[] with [g, h, l]
[p,u,w,y] with [s]
This example is very inefficient, but in my experience if the subarrays are large and you carefully choose n, it can provide good parallelization and faster sorts.
You can write the result from a single thread to the output file before the other results finish because you know it starts at a_ind[i] + b_ind[i] and how many values you're merging on that thread.
One challenge is finding a way to evenly divide the work. I picked values randomly and picked quite a few more than I had cores, which works well for distributing the work across threads evenly.
This is really just a generalization of @chqrlie's answer, but hope it might help someone.