29

I have some performance critical code that involves sorting a very short fixed-length array with between around 3 and 10 elements in C++ (the parameter changes at compile time).

It occurred to me that a static sorting network specialised to each possible input size would perhaps be a very efficient way to do this: We do all the comparisons necessary to figure out which case we are in, then do the optimal number of swaps to sort the array.

To apply this, we use a bit of template magic to deduce the array length and apply the correct network:

#include <iostream>
using namespace std;

template< int K >
void static_sort(const double(&array)[K])
{
    cout << "General static sort\n" << endl;
}

template<>
void static_sort<3>(const double(&array)[3])
{
    cout << "Static sort for K=3" << endl;
}


int main()
{

    double  array[3];

    // performance critical code.
    // ...
    static_sort(array);
    // ...

}

Obviously it's quite a hassle to code all this up, so:

  • Does anyone have any opinions on whether or not this is worth the effort?
  • Does anyone know if this optimisation exists in any standard implementations of, for example, std::sort?
  • Is there an easy place to get hold of code implementing this kind of sorting network?
  • Perhaps it would be possible to generate a sorting network like this statically using template magic..

For now I just use insertion sort with a static template parameter (as above), in the hope that it will encourage unrolling and other compile-time optimisations.

Your thoughts welcome.


Update: I wrote some testing code to compare a 'static' insertion short and std::sort. (When I say static, I mean that the array size is fixed and deduced at compile time (presumably allowing loop unrolling etc). I get at least a 20% NET improvement (note that the generation is included in the timing). Platform: clang, OS X 10.9.

The code is here https://github.com/rosshemsley/static_sorting if you would like to compare it to your implementations of stdlib.

I have still yet to find a nice set of implementations for comparator network sorters.


12
  • what are the values you are sorting? Are they in any fixed range? Commented Nov 5, 2013 at 13:50
  • my values actually happen to be angles in [0,2pi). But I guess my idea was to focus on comparitor networks, so the value type shouldn't matter too much. Commented Nov 5, 2013 at 13:52
  • 5
    @RossHemsley, have you actually tried to see if the sort takes a significant amount of time on the execution of your program? Commented Nov 5, 2013 at 13:53
  • 1
    I remember there was a C sorting-network-generator on some internet page (unluckily can't remember/find it now). Used to benchmark this some two years or so ago, and found that upwards of 5 elements the difference to std::sort is rather small or pretty much non-existent. Anyway, just for testing out speed, using a code gen may be the easiest solution (if you don't want to do template stuff for academic purposes). This here is a similar website, it only outputs an array of pairs to compare/swap, but I think one can wrap that in C code in 5 mins. Commented Nov 5, 2013 at 13:58
  • 1
    @RossHemsley, well then you could be wasting your time. My suggestion is to just use std::sort, then run the program with a profiler and see how much time it spends in that function. What do you know, perhaps std::sort is also smart ;) The question is still interesting, of course. Commented Nov 5, 2013 at 14:02

4 Answers 4

23

Here is a little class that uses the Bose-Nelson algorithm to generate a sorting network on compile time.

/**
 * A Functor class to create a sort for fixed sized arrays/containers with a
 * compile time generated Bose-Nelson sorting network.
 * \tparam NumElements  The number of elements in the array or container to sort.
 * \tparam T            The element type.
 * \tparam Compare      A comparator functor class that returns true if lhs < rhs.
 */
template <unsigned NumElements, class Compare = void> class StaticSort
{
    template <class A, class C> struct Swap
    {
        template <class T> inline void s(T &v0, T &v1)
        {
            T t = Compare()(v0, v1) ? v0 : v1; // Min
            v1 = Compare()(v0, v1) ? v1 : v0; // Max
            v0 = t;
        }

        inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
    };

    template <class A> struct Swap <A, void>
    {
        template <class T> inline void s(T &v0, T &v1)
        {
            // Explicitly code out the Min and Max to nudge the compiler
            // to generate branchless code.
            T t = v0 < v1 ? v0 : v1; // Min
            v1 = v0 < v1 ? v1 : v0; // Max
            v0 = t;
        }

        inline Swap(A &a, const int &i0, const int &i1) { s(a[i0], a[i1]); }
    };

    template <class A, class C, int I, int J, int X, int Y> struct PB
    {
        inline PB(A &a)
        {
            enum { L = X >> 1, M = (X & 1 ? Y : Y + 1) >> 1, IAddL = I + L, XSubL = X - L };
            PB<A, C, I, J, L, M> p0(a);
            PB<A, C, IAddL, J + M, XSubL, Y - M> p1(a);
            PB<A, C, IAddL, J, XSubL, M> p2(a);
        }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 1>
    {
        inline PB(A &a) { Swap<A, C> s(a, I - 1, J - 1); }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 1, 2>
    {
        inline PB(A &a) { Swap<A, C> s0(a, I - 1, J); Swap<A, C> s1(a, I - 1, J - 1); }
    };

    template <class A, class C, int I, int J> struct PB <A, C, I, J, 2, 1>
    {
        inline PB(A &a) { Swap<A, C> s0(a, I - 1, J - 1); Swap<A, C> s1(a, I, J - 1); }
    };

    template <class A, class C, int I, int M, bool Stop = false> struct PS
    {
        inline PS(A &a)
        {
            enum { L = M >> 1, IAddL = I + L, MSubL = M - L};
            PS<A, C, I, L, (L <= 1)> ps0(a);
            PS<A, C, IAddL, MSubL, (MSubL <= 1)> ps1(a);
            PB<A, C, I, IAddL, L, MSubL> pb(a);
        }
    };

    template <class A, class C, int I, int M> struct PS <A, C, I, M, true>
    {
        inline PS(A &a) {}
    };

public:
    /**
     * Sorts the array/container arr.
     * \param  arr  The array/container to be sorted.
     */
    template <class Container> inline void operator() (Container &arr) const
    {
        PS<Container, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
    };

    /**
     * Sorts the array arr.
     * \param  arr  The array to be sorted.
     */
    template <class T> inline void operator() (T *arr) const
    {
        PS<T*, Compare, 1, NumElements, (NumElements <= 1)> ps(arr);
    };
};

#include <iostream>
#include <vector>

int main(int argc, const char * argv[])
{
    enum { NumValues = 32 };

    // Arrays
    {
        int rands[NumValues];
        for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
        std::cout << "Before Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
        StaticSort<NumValues> staticSort;
        staticSort(rands);
        std::cout << "After Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
    }

    std::cout << "\n";

    // STL Vector
    {
        std::vector<int> rands(NumValues);
        for (int i = 0; i < NumValues; ++i) rands[i] = rand() % 100;
        std::cout << "Before Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
        StaticSort<NumValues> staticSort;
        staticSort(rands);
        std::cout << "After Sort: \t";
        for (int i = 0; i < NumValues; ++i) std::cout << rands[i] << " ";
        std::cout << "\n";
    }

    return 0;
}

Benchmarks

The following benchmarks are compiled with clang -O3 and ran on my mid-2012 macbook air.

Time (in milliseconds) to sort 1 million arrays.
The number of milliseconds for arrays of size 2, 4, 8 are 1.943, 8.655, 20.246 respectively.
C++ Templated Bose-Nelson Static Sort timings

Here are the average clocks per sort for small arrays of 6 elements. The benchmark code and examples can be found at this question:
Fastest sort of fixed length 6 int array

Direct call to qsort library function   : 342.26
Naive implementation (insertion sort)   : 136.76
Insertion Sort (Daniel Stutzbach)       : 101.37
Insertion Sort Unrolled                 : 110.27
Rank Order                              : 90.88
Rank Order with registers               : 90.29
Sorting Networks (Daniel Stutzbach)     : 93.66
Sorting Networks (Paul R)               : 31.54
Sorting Networks 12 with Fast Swap      : 32.06
Sorting Networks 12 reordered Swap      : 29.74
Reordered Sorting Network w/ fast swap  : 25.28
Templated Sorting Network (this class)  : 25.01 

It performs as fast as the fastest example in the question for 6 elements.

The code used for the benchmarks can be found here.

It includes more features and further optimizations for more robust performance on real-world data.

Sign up to request clarification or add additional context in comments.

2 Comments

Used it and I'm not facing fast sorting really for 32 std::pair<int, int>. But nice clean code, thanks.
@amirali If it is an array of uint32_t pairs, convert the pairs into uint64_t, sort the array, and then convert back. That will allow the static sort to use the min/max instructions.
11

The other answers are interesting and fairly good, but I believe that I can provide some additional elements of answer, point per point:

  • Is it worth the effort? Well, if you need to sort small collections of integers and the sorting networks are tuned to take advantage of some instructions as much as possible, it might be worth the effort. The following graph presents the results of sorting a million arrays of int of size 0-14 with different sorting algorithms. As you can see, the sorting networks can provide a significant speedup if you really need it.

  • No standard implementation of std::sort I know of use sorting networks; when they are not fine-tuned, they might be slower than a straight insertion sort. libc++'s std::sort has dedicated algorithms to sort 0 thru 5 values at once but they it doesn't use sorting networks either. The only sorting algorithm I know of which uses sorting networks to sort a few values is Wikisort. That said, the research paper Applying Sorting Networks to Synthesize Optimized Sorting Libraries suggests that sorting networks could be used to sort small arrays or to improve recursive sorting algorithms such as quicksort, but only if they are fine-tuned to take advantage of specific hardware instructions.

    The access aligned sort algorithm is some kind of bottom-up mergesort that apparently uses bitonic sorting networks implemented with SIMD instructions for the first pass. Apparently, the algorithm could be faster than the standard library one for some scalar types.

  • I can actually provide such information for the simple reason that I developed a C++14 sorting library that happens to provide efficient sorting networks of size 0 thru 32 that implement the optimizations described in the previous section. I used it to generate the graph in the first section. I am still working on the sorting networks part of the library to provide size-optimal, depth-optimal and swaps-optimal networks. Small optimal sorting networks are found with brute force while bigger sorting networks use results from the litterature.

    Note that none of the sorting algorithms in the library directly use sorting networks, but you can adapt them so that a sorting network will be picked whenever the sorting algorithm is given a small std::array or a small fixed-size C array:

    using namespace cppsort;
    
    // Sorters are function objects that can be
    // adapted with sorter adapters from the
    // library
    using sorter = small_array_adapter<
        std_sorter,
        sorting_network_sorter
    >;
    
    // Now you can use it as a function
    sorter sort;
    
    // Instead of a size-agnostic sorting algorithm,
    // sort will use an optimal sorting network for
    // 5 inputs since the bound of the array can be
    // deduced at compile time
    int arr[] = { 2, 4, 7, 9, 3 };
    sort(arr);
    

    As mentioned above, the library provides efficient sorting networks for built-in integers, but you're probably out of luck if you need to sort small arrays of something else (e.g. my latest benchmarks show that they are not better than a straight insertion sort even for long long int).

  • You could probably use template metaprogramming to generate sorting networks of any size, but no known algorithm can generate the best sorting networks, so you might as well write the best ones by hand. I don't think the ones generated by simple algorithms can actually provide usable and efficient networks anyway (Batcher's odd-even sort and pairwise sorting networks might be the only usable ones) [Another answer seems to show that generated networks could actually work].

Comments

8

There are known optimal or at least best length comparator networks for N<16, so there's at least a fairly good starting point. Fairly, since the optimal networks are not necessarily designed for maximum level of parallelism achievable with e.g. SSE or other vector arithmetics.

Another point is that already some optimal networks for some N are degenerate versions for a slightly larger optimal network for N+1.

From wikipedia:

The optimal depths for up to 10 inputs are known and they are respectively 0, 1, 3, 3, 5, 5, 6, 6, 7, 7.

This said, I'd pursuit for implementing networks for N={4, 6, 8 and 10}, since the depth constraint cannot be simulated by extra parallelism (I think). I also think, that the ability to work in registers of SSE (also using some min/max instructions) or even some relatively large register set in RISC architecture will provide noticeable performance advantage compared to "well known" sorting methods such as quicksort due to absence of pointer arithmetic and other overhead.

Additionally, I'd pursuit to implement the parallel network using the infamous loop unrolling trick Duff's device.

EDIT When the input values are known to be positive IEEE-754 floats or doubles, it's also worth to mention that the comparison can also be performed as integers. (float and int must have same endianness)

8 Comments

Interesting note about float and int. I had never noticed the less-than relation on the bit pattern is the same for both float and int interpretation. I have to verify this.
Duff's device is a clever way to use switch to jump into an unrolled loop at the desired starting point. You can just plain unroll a loop without using Duff's device. Maybe I'm missing something here, but it sounds like you're using "Duff's device" as a synonym for unrolling (which is incorrect).
I probably meant using switch case with fall through, but you are right--there's no need to jump to a starting point and loop. A fall through would just reuse code (if even possible) of different networks.
@PeterCordes Duff’s device encapsulating a loop with a switch that jumps to a certain part of it, allowing the loop to be artificially unrolled. The advantage over complete unrolling is it’s friendlier on the uop cache and LSD; the disadvantage is (often) worse assembly gen as compilers bail out of a lot of optimizations when they see Duff’s device.
The sane alternative to Duff's device is a startup and/or cleanup loop (using scalar and/or one vector at a time) which still takes some extra code size, but is far short of fully unrolling. The advantage is that it lets the actual loop be fully optimized, not needing to support multiple entry points, which could require a compiler to use multiple separate pointer-increment instructions. And would defeat auto-vectorization. (The original Duff's device was targeting a machine with a pre- or post-increment addressing mode so there wasn't a benefit to doing one ADD every 8 out isns)
The usecase for Duff’s device is when you can only guarantee loop unrolling offset from a condition that can’t be known in advance. E.g. it’s 100% safe and portable to read past the end of buffers in C as long as you don’t cross a page boundary (and as long as ASAN isn’t on.) You can use duffs device to jump into the middle of an unrolled string operation based on the offset from the page so that you only have to do extraneous edge case checks once every few thousand bytes instead of once every byte. Yes, optimization bail-out is a big disadvantage of Duff’s device, as I mentioned
I can't picture a sane C or asm loop that matches your description. Unrolling by 4096, with absolutely massive code size? You could instead have an inner loop unrolled by something sane like 8, and an outer loop. On the first iteration of the outer/inner loop, enter the inner loop with an offset such that it will end at the end of a page. Or just a startup loop to get to a multiple of 16 bytes, and have a single loop that processes an aligned 16-byte chunk, like for SIMD-vectorized strlen. (Harder with multiple vectors that can be relatively misaligned.)
you're right unrolling by a full 4096 bytes would be insane but (1.) compilers are very finicky and (2.) hitting the uop cache is more vital than anything at cold entry to functions. It can be very hard to get compilers to obey properly how you want them to compile setup/entry code that gets up to the desired alignment (to avoid a nasty rats' maze of excessive branching). I'm not disagreeing with anything you're saying and I personally haven't had a usecase yet for Duff's device. However, I am prepared to use Duff's device if it's ever the right tool for the job.
3

Let me share some thoughts.

Does anyone have any opinions on whether or not this is worth the effort?

It is impossible to give a correct answer. You have to profile your actual code to find that out. In my practice, when it comes to low-level profiling, the bottleneck was always not where I thought.

Does anyone know if this optimisation exists in any standard implementations of, for example, std::sort?

For example, Visual C++ implementation of std::sort uses insertion sort for small vectors. I'm not aware of an implementation which uses optimal sorting networks.

Perhaps it would be possible to generate a sorting network like this statically using template magic

There are algorithms for generating sorting networks, such as Bose-Nelson, Hibbard, and Batcher's algorithms. As C++ templates are Turing-complete, you can implement them using TMP. However, those algorithms are not guaranteed to give the theoretically minimal number of comparators, so you may want to hardcode the optimal network.

1 Comment

1)I asked for opinions since work involved is potentially considerable before any testing is possible. 2) I know std::sort often uses insertion sort for small inputs. But the question is do any implementations consider fixed length arrays or comparator networks? 3) I am fine with the template code, I want the comparator networks! 4) TMP is (usually) Turing complete, and so if there exists an algorithm to generate comparator networks, it is possible. But probably horrible. The question is there for interest's sake ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.