Note that regular quicksort favors slightly skewed distributions to achieve better branch prediction. This is not the case for BlockQuicksort: It really favors good pivots (I use up to 65 elements to find the pivots, from an information theoric background median-of-O(sqrt(n)) would be optimal for BlockQuicksort).
Sorting takes \$O(n * log(n))\$ time in general. But there are corner cases like almost sorted lists or lists with a lot of duplicates. My implementation of BlockQuicksort switches to three-pivotway quicksort (<, = and > partitions with single pivot) when the sample I use to find the pivots has a lot of duplicates. This way performance of \$O(n * log(m))\$ can be achieved where m is the number of different values.
The standard qsort() interface has a major drawback when compared to c++ std::sort(): It uses function pointers. Calling them almost always results in a branch missprediction and is a huge slow down. I don't thinkThis bypasses the idea of BlockQuicksort canto get rid of branches.
Sadly it's quite complex to get a general sorting routine in c without function pointers because of the lack of templates. So if you really need a high performance generic sorting routine in c you probably need to plug in some preprocessor tricks (one of the reasons I switched to c++).
5. Insertion sort
Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.
Using it afterwards results in less branches but requires to load everything another time which isn't a big problem when sorting integers but rather heavy for pointers.
6. Sorting networks
I use sorting networks to find the pivots which is really fast but I have yet to find a way that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be used effectivelycompared differ from the values being swapped (conditional move + xor swap doesn't work).
7. Heap sort
Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in suchheapsort made my sort around 3-4% faster in my tests.
8. Stack usage
I found using recursive calls to be faster than using a scenarioseparated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort after a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizing the separated stack.
Rather special info:
My library also differs from the standard std::sort() for a similar reasoninterface: the comparision function. When sorting pointers a comparision function has to load both args and compare them. That is not necessary with quicksort: The pivot can sometimes be stored in a register/on the stack and only one element needs to be loaded. That's why I use an index() function which gets the value currently sortedto be partitioned (e.g. the pointer) and returns a value which is used to compareactually be compared (e.g. a tuple to sort structs or the value pointed to). ItThis also allows me to call the index function only once per partitioned element and compare it multiple times against the pivots. This way I treat L1 reads for register access.
This example shows that usage is less error prone. It doesn't match watch people learn about sorting and therefore I wouldn't recomment it in a general library. If a comparision function is really needed then one can implement an index function that returns a special type which overloads the comparision function.
5. Insertion sort
Insertion sort is also the finisher in my implementation. I use it as soon as the list is small enough (somewhere between 16-64 elements). This is simply because when sorting pointers this results in the values still being in cache from the last partition call.
Using it afterwards results in less branches but requires to load everything another time which isn'tIf a big problem when sorting integers but rather heavy for pointers.
6. Sorting networks
I use sorting networks to find the pivots whichcomparision function is really fast but I have yet to find a wayneeded then one can implement an index function that outperforms insertion sort when sorting pointers. This is due to the compare-and-swap operation being way slower when the values to be compared differ from the values being swapped (conditional move + xor swap doesn't work).
7. Heap sort
Another performance improvement really was to switch to heap sort with deep recursion. I don't know why I ran into them but adding in heapsort made my sort around 3-4% faster in my tests.
8. Stack usage
I found using recursive calls to be faster than using a separated stack. So you should probably test both. Stackoverflow is not a problem if you limit recursion by switching to heap sort afterreturns a limited number of recursions. Sadly I don't understand exactly why but from looking at the assembly it seems the compiler had problems optimizingspecial type which overloads the separated stackcomparision function.