Generating data to test sorting algorithm

Question

I would like to generate data to test sorting algorithms with. This accomplishes two things:

Find bugs. The output could easily be checked if it was in fact sorted correctly
Profile the code and find which situations take longer for which parts.

I asked the question How do you test speed of sorting algorithm? awhile ago, but this question focuses particularly on generating the data.

I am thinking of

sorted
reverse sorted
random
sorted but then make n inversions in randomly selected elements and see how changing n affects the run time

Any suggestions? Do any frameworks exist that would make this easier? I'm thinking JUnit could be useful.

In this question on comp sci se an answer makes it sound like adding inversions and counting them doesn't mean much:

The number of inversions might work for some cases, but is sometimes insufficient. An example given in [3] is the sequence

$$\langle \lfloor n/2 \rfloor + 1, \lfloor n/2 \rfloor + 2, \ldots, n, 1, \ldots, \lfloor n/2 \rfloor \rangle$$

that has a quadratic number of inversions, but only consists of two ascending runs. It is nearly sorted, but this is not captured by inversions.

I'm not particularly strong in math and don't understand how the example illustrates what's wrong with counting the number of inversions? Is it just academic? How does it make sense to say "quadratic number of inversions"?

My unscientific observations: Real data that needs to be sorted is usually either (listed here in no particular order): a) effectively random (because the order that the data was generated from is based on a different key), b) already sorted, c) already sorted in reverse order, or d) mostly sorted with a (relatively) few items out of place. And invariably, real data that needs to be sorted has lots of duplicate keys - N can outnumber K by many orders of magnitude. — 500 - Internal Server Error
– 500 - Internal Server Error, Commented Oct 22, 2016 at 1:16
@500-InternalServerError you seem to have experience, may I ask where from? Is there any online sources of free sample data? — Celeritas
– Celeritas, Commented Oct 22, 2016 at 5:56
Also try reverse sorted with a few inversions. To model adding data to an existing sorted file try, say, 90% sorted with 10% random records at the end. — rossum
– rossum, Commented Nov 2, 2016 at 12:16

rcgldr · Accepted Answer · 2016-10-22 04:04:41Z

0

Using integer math, the $$...$$ sequence can represent an array:

    1      2      n/2              n    indices
n/2+1, n/2+2, ...,  n, 1, 2, ... n/2    array values

So as stated, just two ascending sequences.

By the definition of inversion, two elements a[i] and a[j] form an inversion if a[i] > a[j] and i < j. This means that all of the first n/2 elements of a, a[1 to n/2] are greater than all of the second n/2 elements of a, a[(n/2)+1 to n]. So that's (n/2)^2 = n^2/4 inversions which is quadratic.

The relationship between inversion count and sort time complexity depends on the sort algorithm. Using bubble sort on the example array would have time complexity O(n^2). Using generic merge sort on the array would be O(n log(n)), with an near best case comparison count. Using natural merge sort would find the two sorted runs and do a single merge pass for time complexity of O(n).

edited Oct 22, 2016 at 4:04

answered Oct 22, 2016 at 2:14

rcgldr

29.4k3 gold badges41 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Generating data to test sorting algorithm

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related