How to choose an efficient algorithm for a large database permutation search?

Question

I am given the task of writing Java code to read in lines from a database (of which there can be 100s of thousands of lines) and returning the sum of permutations of each line in the database. For example, if there are 3 lines as such:

ABCD
BACD
DCAB

Then the output should be 3 (ABCD == BACD, ABCD == DCAB, BACD == DCAB) where == means 'is a permutation'.

An obvious answer would be (n = words in dictionary, m = characters in word):

Loop through database (O(n))
On each line: (O(1))
1. Sort in order (O(mlogm))
2. Add to array (O(1))
Loop through array (i = 0 : n) (O(n))
1. Loop through array (j = i + 1 : n) (O(n))
  1. If array[i] == array[j] then count++ (O(1))
Return count (O(1))

Total complexity:

= O(n) * (O(1) * (O(mlogm) + O(1))) + O(n) * (O(n) * O(1)) + O(1)
= O(n) * (O(mlogm)) + O(n) * (O(n)) + O(1)
= O(nmlogm) + O(n^2) + O(1)
= O(nmlogm + n^2)
= O(n^2) assuming n >> m

Of course, this is anything but efficient for a large database, so I'd like to see if there are better algorithms. I have thought of using a hash table, but not exactly sure of how to implement it. So, on each line read maybe record the number of occurrences of each character, but then I've only seen implementations of seeing if two words are permutations of each other this way, not n words - this would require more than 1 hash table I believe).

Some information: Not all words are the same length. No information on the distribution is known (it's a randomly 'sorted' database).

Can someone provide some suggestions?

No actual code please.

You are essentially looking for anagrams in a dictionary. You can use the sorted lines as a key for a hash map that contains a list of lines, e.g. perm["inp"] = ["nip", "pin"]. — M Oehm
– M Oehm, Commented Mar 15, 2014 at 10:54
What's that n in step 3)? What's n in step 1)? It seems you attach the same name to different things here. — pentadecagon
– pentadecagon, Commented Mar 15, 2014 at 12:52
Does the dictionary consist of repeated words? If yes, are they considered permutations of each other? — chettyharish
– chettyharish, Commented Mar 15, 2014 at 22:14
@MOehm If I use a hash table, there's no way of me retrieving all of the data after insertion (unless I remember the keys). — BeccaLorna
– BeccaLorna, Commented Mar 16, 2014 at 7:12
@chettyharish It can do, and yes they would be considered permutations of each other. — BeccaLorna
– BeccaLorna, Commented Mar 16, 2014 at 7:13

pentadecagon · Accepted Answer · 2014-03-15 13:07:28Z

0

Keep doing 1 through 4, which take time proportional to the number of rows in your database. But before looking for duplicates you sort the array. Sorting is fast. After sorting you find duplicates next to each other, so in order to look for duplicates you just have to compare each entry with its neighbors. If your database is really big you can do all those operations on the database itself, but if it has merely 100s of thousands of lines you can do that in memory. Altogether this shouldn't take more than a couple of seconds.

answered Mar 15, 2014 at 13:07

pentadecagon

4,8972 gold badges21 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

BeccaLorna Over a year ago

So you recommend to insert each sorted word into the array (O(nmlogm) for n words, m characters), then sort the array (O(nlogn)) and loop through for duplicates (O(n))? Disregarding implementation complexity, is this the most efficient solution for this problem?

pentadecagon Over a year ago

This is what I would do. And FWIW, I would just skip that complexity analysis you keep doing, because understanding complexity is hard. You got it wrong the first time, and you got it wrong again: sorting the array takes O(nlogn) comparisons, but each comparison is O(m) so sorting the array has a complexity of O(mnlogn). But keep in mind, this complexity thing is meant for very large numbers, otherwise constant factors become much more important. Is your m very large? Bottom line, just implement it, and check the performance. It's simple, maybe a couple of hours.

BeccaLorna Over a year ago

We want to sort n words of m characters, how can that be O(mnlogn)? You sort each word (O(mlogm)) n times. This is for an algorithm exercise and I am trying to find the most efficient solution in terms of complexity which is why I keep doing the analysis.

pentadecagon Over a year ago

There are two sorting operations here: First, you sort each individual word of size m. Then you sort the entire array of size n. The former is O(nm log m). The latter O(nm log n ).

chettyharish · Accepted Answer · 2014-03-16 07:36:54Z

The best possible way to do such a problem quickly is by using hash tables. Now as a dictionary will only consist of a limited set of characters (generally 52 for each English Character) All you have to do is :

Generate the hash values of all possible characters and store it in some table. (Using hash function with avalanche effect such as SHA -2. The avalanche effect ensures that the collision probability is very low ). Now though this step might be high in complexity (depends on the internal working of SHA -2 implementation by your language), It only has to be carried out for a fixed number of times so its complexity is O(1). (You should note that O(1) doesn't mean its the fastest, it may take few seconds for hashing.)
Now go through all the words one at a time and just add the hashes of individual characters and store it as line hash value. O(nx) where x is the average length of words. Again to be noted that x will be generally a small number such as 6-7 making the function O(7n) === O(n)
Now just go through the line hash values, if two line hash values are same, then they are a permutation. Again a O(n) step.

O() notation is actually not the best way to measure the quickness of an algorithm on finer granularity. A better option is the ~ notation.

So my program will take O(n) time complexity & O(n) space complexity. (The space would be constant for the hashing table in step 1, and would be equal to number of lines in step 2 thus O(n)).

Now for all the skeptics who believe that two different words might have the same hash line value total, The probability of that happening is lower than a meteor hitting your laptop.

I hope this helps you

The above implementation should be simple, there are better ways than this

Collectives™ on Stack Overflow

How to choose an efficient algorithm for a large database permutation search?

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related