3

I need to sort around 100000 strings ASCIIbetically and by length, I sort by lengths by putting it into a 2D vector at the length of the string and then sort each array using quicksort (for ASCIIbetically). But is there a faster sort for strings of equal length? I've heard radix is great but I find it difficult to understand. What would be the best way to sort equal length strings without using the sort() function? If you need the code I can post it up.

11
  • "ASCIIbetically" - nice one :-) Commented Sep 9, 2014 at 3:12
  • 1
    It's a thing, I swears it =] Commented Sep 9, 2014 at 3:23
  • I don't get what you want to achieve. Maybe a short example would help. Commented Sep 9, 2014 at 5:22
  • Okay so we are given about 100000 strings and we have to sort these strings n times in the fastest way possible. So say, 'the piece of puzzle Was higher' will end up being 'of Was the piece higher puzzle'. I'm just looking for a faster way to sort strings in this manner. Commented Sep 9, 2014 at 6:37
  • 1
    Have you read this article: Engineering Radix Sort for Strings ? It makes a lot of effort to optimize the radix sorting for strings. Commented Sep 9, 2014 at 9:01

2 Answers 2

2

I think building a trie and then retrieving the keys in the trie by means of pre-order traversal is about as efficient as it gets for string sorting, and is actually a form of radix sort. Here is a detailed academic paper discussing this method. In 2006 at least this was the fastest string sorting method out there.

Sign up to request clarification or add additional context in comments.

3 Comments

At the moment i'm getting around 0.38 seconds for sorting around 70000 strings using vectors and quicksort. Would a trie beat that? (then again it depends on the computer)
Sorry, extra digit 0.38 seconds
Does anyone have some simple example code for trie that I can look at, even pseudocode?
1

For strings of between 8 and 15 characters, your comparison function for quick sort could do the first 8 characters in a single 64-bit chunk. And so on for 16 to 31, etc. So, you end up with as many comparison functions as you feel makes a difference. Unless you have a very large number of strings with long common sub strings, just using what you know about the string lengths may do the trick, straightforwardly.

For completeness, you need to worry about alignment and byte order. So, fetching 8 bytes at a time into a uint64_t:

  uint64_t u ;

  memcpy(&u, pv, 8) ;
  ...convert to big-endian if required...

will do the trick. I can tell you that with gcc and -O2 on an x86_64 the memcpy() compiles to a single instruction, as if it was u = *(uint64_t*)pv :-) For processors with alignment issues, I would hope that the compiler does something suitable.

Sadly, memcmp(foo, bar, 8) does not get the same treatment (at least on gcc 4.8, not even with -O3) :-(

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.