I need to sort around 100000 strings ASCIIbetically and by length, I sort by lengths by putting it into a 2D vector at the length of the string and then sort each array using quicksort (for ASCIIbetically). But is there a faster sort for strings of equal length? I've heard radix is great but I find it difficult to understand. What would be the best way to sort equal length strings without using the sort() function? If you need the code I can post it up.
-
"ASCIIbetically" - nice one :-)paxdiablo– paxdiablo2014-09-09 03:12:55 +00:00Commented Sep 9, 2014 at 3:12
-
1It's a thing, I swears it =]Jackson Collins– Jackson Collins2014-09-09 03:23:01 +00:00Commented Sep 9, 2014 at 3:23
-
I don't get what you want to achieve. Maybe a short example would help.Henry– Henry2014-09-09 05:22:03 +00:00Commented Sep 9, 2014 at 5:22
-
Okay so we are given about 100000 strings and we have to sort these strings n times in the fastest way possible. So say, 'the piece of puzzle Was higher' will end up being 'of Was the piece higher puzzle'. I'm just looking for a faster way to sort strings in this manner.Jackson Collins– Jackson Collins2014-09-09 06:37:21 +00:00Commented Sep 9, 2014 at 6:37
-
1Have you read this article: Engineering Radix Sort for Strings ? It makes a lot of effort to optimize the radix sorting for strings.vgru– vgru2014-09-09 09:01:13 +00:00Commented Sep 9, 2014 at 9:01
2 Answers
I think building a trie and then retrieving the keys in the trie by means of pre-order traversal is about as efficient as it gets for string sorting, and is actually a form of radix sort. Here is a detailed academic paper discussing this method. In 2006 at least this was the fastest string sorting method out there.
3 Comments
For strings of between 8 and 15 characters, your comparison function for quick sort could do the first 8 characters in a single 64-bit chunk. And so on for 16 to 31, etc. So, you end up with as many comparison functions as you feel makes a difference. Unless you have a very large number of strings with long common sub strings, just using what you know about the string lengths may do the trick, straightforwardly.
For completeness, you need to worry about alignment and byte order. So, fetching 8 bytes at a time into a uint64_t:
uint64_t u ;
memcpy(&u, pv, 8) ;
...convert to big-endian if required...
will do the trick. I can tell you that with gcc and -O2 on an x86_64 the memcpy() compiles to a single instruction, as if it was u = *(uint64_t*)pv :-) For processors with alignment issues, I would hope that the compiler does something suitable.
Sadly, memcmp(foo, bar, 8) does not get the same treatment (at least on gcc 4.8, not even with -O3) :-(