10

Today at school the teacher asked us to implement a duplicate-deletion algorithm. It's not that difficult, and everyone came up with the following solution (pseudocode):

for i from 1 to n - 1
    for j from i + 1 to n
        if v[i] == v[j] then remove(v, v[j])    // remove(from, what)
    next j
next i

The computational complexity for this algo is n(n-1)/2. (We're in high school, and we haven't talked about big-O, but it seems to be O(n^2)). This solution appears ugly and, of course, slow, so I tried to code something faster:

procedure binarySearch(vector, element, *position)
    // this procedure searches for element in vector, returning
    // true if found, false otherwise. *position will contain the
    // element's place (where it is or where it should be)
end procedure

----

// same type as v
vS = new array[n]

for i from 1 to n - 1
    if binarySearch(vS, v[i], &p) = true then
        remove(v, v[i])
    else
        add(vS, v[i], p)      // adds v[i] in position p of array vS
    end if
next i

This way vS will contain all the elements we've already passed. If element v[i] is in this array, then it is a duplicate and is removed. The computational complexity for the binary search is log(n) and for the main loop (second snippet) is n. Therefore the whole CC is n*log(n) if I'm not mistaken.

Then I had another idea about using a binary tree, but I can't put it down.
Basically my questions are:

  • Is my CC calculation right? (and, if not, why?)
  • Is there a faster method for this?

Thanks

5
  • 1
    Just for the record, this is indeed O(n^2). Commented May 20, 2011 at 12:35
  • What is the type of vS, and what does add do exactly? Commented May 20, 2011 at 12:42
  • @Robin Green: vS is like v, and add adds the specified element in the specified position Commented May 20, 2011 at 12:44
  • 1
    The complexity (not time/space but LOC) of the fast version depends on whether you are allowed to sort the array. If you are allowed to change the order (i.e. sort) it gets very simple. If you are not, you have to resort to a trick: sort the indexes and use those too look up duplicates. Commented May 20, 2011 at 15:02
  • @likao: clever way, I like it :) Commented May 20, 2011 at 15:33

7 Answers 7

15

The easiest solution will be to simply sort the array (takes O(n log n) with standard implementation if you may use them. Otherwise consider making an easy randomized quicksort (code is even on Wikipedia)).

Afterwards scan it for one additional time. During that scan simple eliminate consecutive identical elements.

If you want to do it in O(n), you can also use a HashSet with elements you have already seen. Just iterate once over your array, for each element check if it is in your HashSet.

If it isn't in there, add it. If it is in there, remove it from the array.

Note, that this will take some additional memory and the hashing will have a constant factor that contributes to your runtime. Although the time complexity is better, the practical runtime will only be only be faster once you exceed a certain array size

Sign up to request clarification or add additional context in comments.

4 Comments

Could you explain more deeply the hashset idea please?
@BlackBear: this is the same idea that Gumbo explained, a hash set is just a name of a hash table in which the key is also the value.
A HashSet is a data structure that supports insert and membership test in constant time. In your case you certainly don't want to implement such a data structure on your own but use and existing one for your programming languages instead. The set will allow adding keys and checking if they are already contained in the set. Since both operations are supported in constant time and you do 1 membership test (+ 1 insert or delete from your array) for each element, you end up with O(n). Note that this requires delete/remove to happen in constant time.
Except that operations on a HashSet are average case O(1). Worst case is O(n) (if you have a meshugganah hash function), so you can only guarantee O(n^2) for the whole algorithm.
6

You can often use a space-time tradeoff and invest more space to reduce time.

In this case you could use a hash table to determine the unique words.

3 Comments

+1 Great, I thought about this too, but couldn't find the hash function. Could you provide one, please?
@BlackBear: Many programming languages already have such a data structure that allows a mapping of keys onto values.
@BlackBear: don't worry about the hash function, most languages have one for the strings already built-in.
2

add is O(n), so your CC calculation is wrong. Your algorithm is O(n^2).

Moreover, how would remove be implemented? It also looks like it would be O(n) - so the initial algorithm would be O(n^3).

Comments

0

Binary search will only work if the array you're searching is sorted. I guess that's not the case here, or you wouldn't be looping over your entire array in the inner loop of the original solution.

2 Comments

The binary search is applied to vS, not v (which is the original array). I keep it sorted, inserting elements in their correct place.
@BlackBear: ah yes, I read it too fast ; ). In that case, it looks right to me, assuming vS can be initialised to contain values which are not in v
0

If the order of the final solution is irrelevant, you could break the array into smaller arrays based on length of the strings, and then remove duplicates from those arrays. Example:

// You have 
{"a", "ab", "b", "ab", "a", "c", "cd", "cd"}, 

// you break it into 
{"a", "b", "a", "c"} and {"ab", "ab", "cd", "cd"}, 

// remove duplicates from those arrays using the merge method that others have mentioned, 
// and then combine the arrays back together into 
{"a", "b", "c", "ab", "cd"}

Comments

0

This is the shortest algorithm that worked where arrNames and arrScores is parallel arrays and the highest score is taken.

I := 0;
J := 0;
//iCount being the length of the array

for I := 1 to iCount do
for J := I + 1 to iCount do

   if arrNames[I] = arrNames[J] then
   begin

     if arrScores[I] <= arrScores[J] then
     arrScores[I] := arrScores[J];

   arrScores[J] := arrScores[iCount];
   arrNames[J] := arrNames[iCount];

   arrScores[iCount] := 0;
   arrNames[iCount] := '';

   Dec(iCount);
   end;

Comments

0
def dedup(l):
    ht, et = [(None, None) for _ in range(len(l))], []
    for e in l:
        h, n = hash(e), h % len(ht)
        while True:
            if ht[n][0] is None:
                et.append(e)
                ht[n] = h, len(et) - 1
            if ht[n][0] == h and et[ht[n][1]] == e:
                break
            if (n := n + 1) == len(ht):
                n = 0
    return et

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.