Revisions to Double-Pivot Quicksort in C

Some further updates

Source Link

edited Jan 9, 2017 at 21:56

170
7

Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots, from an information theoric background median-of-O(sqrt(n)) would be optimal for BlockQuicksort).

Sorting takes \$O(n * log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-pivotway quicksort (<, = and > partitions with single pivot) when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n * log(m))\$ can be achieved where m is the number of different values.

The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. I don't thinkThis bypasses the idea of BlockQuicksort canto get rid of branches.

Sadly it's quite complex to get a general sorting routine in c without function pointers because of the lack of templates. So if you really need a high performance generic sorting routine in c you probably need to plug in some preprocessor tricks (one of the reasons I switched to c++).

5. Insertion sort

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.

6. Sorting networks

I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be used effectivelycompared differ from the values being swapped (conditional move + xor swap doesn't work).

7. Heap sort

Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in suchheapsort made my sort around 3-4% faster in my tests.

8. Stack usage

I found using recursive calls to be faster than using a scenarioseparated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.

Rather special info:

My library also differs from the standard std::sort() for a similar reasoninterface: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can sometimes be stored in a register/on the stack and only one element needs to be loaded. That's why I use an index() function which gets the value currently sortedto be partitioned (e.g. the pointer) and returns a value which is used to compareactually be compared (e.g. a tuple to sort structs or the value pointed to). ItThis also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots. This way I treat L1 reads for register access.

This example shows that usage is less error prone. It doesn't match watch people learn about sorting and therefore I wouldn't recomment it in a general library. If a comparision function is really needed then one can implement an index function that returns a special type which overloads the comparision function.

5. Insertion sort

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn'tIf a big problem when sorting integers but rather heavy for pointers.

6. Sorting networks

I use sorting networks to find the pivots whichcomparision function is really fast but I have yet to find a wayneeded then one can implement an index function that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).

7. Heap sort

Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.

8. Stack usage

I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort afterreturns a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizingspecial type which overloads the separated stackcomparision function.

Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots).

Sorting takes \$O(n * log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-pivot quicksort when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n * log(m))\$ can be achieved where m is the number of different values.

The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. I don't think BlockQuicksort can be used effectively in such a scenario.

My library also differs from the standard std::sort() for a similar reason: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can be stored in a register/on the stack and only one element needs to be loaded. That's why I use an index() function which gets the value currently sorted and returns a value which is used to compare (e.g. a tuple to sort structs or the value pointed to). It also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots.

This example shows that usage is less error prone. It doesn't match watch people learn about sorting and therefore I wouldn't recomment it in a general library. If a comparision function is really needed then one can implement an index function that returns a special type which overloads the comparision function.

5. Insertion sort

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.

6. Sorting networks

I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).

7. Heap sort

Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.

8. Stack usage

I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.

Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots, from an information theoric background median-of-O(sqrt(n)) would be optimal for BlockQuicksort).

Sorting takes \$O(n * log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-way quicksort (<, = and > partitions with single pivot) when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n * log(m))\$ can be achieved where m is the number of different values.

The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. This bypasses the idea of BlockQuicksort to get rid of branches.

Sadly it's quite complex to get a general sorting routine in c without function pointers because of the lack of templates. So if you really need a high performance generic sorting routine in c you probably need to plug in some preprocessor tricks (one of the reasons I switched to c++).

5. Insertion sort

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.

6. Sorting networks

I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).

7. Heap sort

Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.

8. Stack usage

I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.

Rather special info:

My library also differs from the standard std::sort() interface: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can sometimes be stored in a register and only one element needs to be loaded. That's why I use an index() function which gets the value currently to be partitioned (e.g. the pointer) and returns a value which is used to actually be compared (e.g. a tuple to sort structs or the value pointed to). This also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots. This way I treat L1 reads for register access.

This example shows that usage is less error prone. It doesn't match watch people learn about sorting and therefore I wouldn't recomment it in a general library. If a comparision function is really needed then one can implement an index function that returns a special type which overloads the comparision function.

Improved Mark-up

Source Link

edited Jan 5, 2017 at 17:31

Christoph

170
7

Number of scanned elements in the whole sort on average:
  3 pivots: 1.385 * n * log2(n) 
  5 pivots: 1.379 * n * log2(n)

Number of copied elements in the whole sort on average:
  3 pivots: 1.108 * n * log2(n) 
  5 pivots: 1.248 * n * log2(n)

In my implementation I use a heuristic to choose between copying together key&value + BlockQuicksort and three-pivot quicksort when sorting pointers to integers but this is specially tuned towards my SACA. Generally copying + BlockQuicksort is favored for bigger lists.In my implementation I use a heuristic to choose between copying together key&value + BlockQuicksort and three-pivot quicksort when sorting pointers to integers but this is specially tuned towards my SACA. Generally copying + BlockQuicksort is favored for bigger lists.

Sorting takes O(n * log(n))\$O(n * log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-pivot quicksort when the sample I use to find the pivotpivots has a lot of duplicates. This way performance of O(n * log(m))\$O(n * log(m))\$ can be achieved where m is the number of different values.

pdqsort nicely tackles not so random permutations. I greatly encourage you to look at the source code on github for info and benchmark data (it also uses BlockQuicksort). It is able to detect partially sorted subsequences and sometimes sort them in \$O(n)\$ time.

The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. I don't think BlockQuicksort can be used effectively in such a scenario.

My library also differs from the standard std::sort() for a similar reason: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can be stored in a register/on the stack and only one element needs to be loaded. That's why I use an index() function which gets the value currently sorted and returns a value which is used to compare (e.g. a tuple to sort structs or the value pointed to). It also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots. This results in less than 0.7 * n * log2(n) calls to index() compared to the average 2 * n * log2(n) calls to compare for randomized quicksort.

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.

Another performance improvement really was to really switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.

Number of scanned elements in the whole sort on average:
  3 pivots: 1.385 * n * log2(n) 
  5 pivots: 1.379 * n * log2(n)

Number of copied elements in the whole sort on average:
  3 pivots: 1.108 * n * log2(n) 
  5 pivots: 1.248 n log2(n)

In my implementation I use a heuristic to choose between copying together key&value + BlockQuicksort and three-pivot quicksort when sorting pointers to integers but this is specially tuned towards my SACA. Generally copying + BlockQuicksort is favored for bigger lists.

Sorting takes O(n * log(n)) time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-pivot quicksort when the sample I use to find the pivot has a lot of duplicates. This way performance of O(n * log(m)) can be achieved where m is the number of different values.

pdqsort nicely tackles not so random permutations. I greatly encourage you to look at the source code on github for info and benchmark data (it also uses BlockQuicksort).

The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. I don't think BlockQuicksort can be used in such a scenario.

My library also differs from the standard std::sort() for a similar reason: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can be stored in a register/on the stack and only one element needs to be loaded. That's why I use an index() function which gets the value currently sorted and returns a value which is used to compare (e.g. a tuple to sort structs or the value pointed to). It also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots. This results in less than 0.7 * n * log2(n) calls to index() compared to the average 2 * n * log2(n) calls to compare for randomized quicksort.

Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough. This is simply because when sorting pointers this results in the values still being in cache from the last partition call.

Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers.

Another performance improvement was to really switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.

Number of scanned elements in the whole sort on average:
  3 pivots: 1.385 * n * log2(n) 
  5 pivots: 1.379 * n * log2(n)

Number of copied elements in the whole sort on average:
  3 pivots: 1.108 * n * log2(n) 
  5 pivots: 1.248 * n * log2(n)