Best algorithm for delete duplicates in array of strings

Question

Today at school the teacher asked us to implement a duplicate-deletion algorithm. It's not that difficult, and everyone came up with the following solution (pseudocode):

for i from 1 to n - 1
    for j from i + 1 to n
        if v[i] == v[j] then remove(v, v[j])    // remove(from, what)
    next j
next i

The computational complexity for this algo is n(n-1)/2. (We're in high school, and we haven't talked about big-O, but it seems to be O(n^2)). This solution appears ugly and, of course, slow, so I tried to code something faster:

procedure binarySearch(vector, element, *position)
    // this procedure searches for element in vector, returning
    // true if found, false otherwise. *position will contain the
    // element's place (where it is or where it should be)
end procedure

----

// same type as v
vS = new array[n]

for i from 1 to n - 1
    if binarySearch(vS, v[i], &p) = true then
        remove(v, v[i])
    else
        add(vS, v[i], p)      // adds v[i] in position p of array vS
    end if
next i

This way vS will contain all the elements we've already passed. If element v[i] is in this array, then it is a duplicate and is removed. The computational complexity for the binary search is log(n) and for the main loop (second snippet) is n. Therefore the whole CC is n*log(n) if I'm not mistaken.

Then I had another idea about using a binary tree, but I can't put it down.
Basically my questions are:

Is my CC calculation right? (and, if not, why?)
Is there a faster method for this?

Thanks

@Robin Green: vS is like v, and add adds the specified element in the specified position — BlackBear
– BlackBear, Commented May 20, 2011 at 12:44
The complexity (not time/space but LOC) of the fast version depends on whether you are allowed to sort the array. If you are allowed to change the order (i.e. sort) it gets very simple. If you are not, you have to resort to a trick: sort the indexes and use those too look up duplicates. — LiKao
– LiKao, Commented May 20, 2011 at 15:02

stephenarosaj · Accepted Answer · 2024-07-02 14:17:22Z

15

The easiest solution will be to simply sort the array (takes O(n log n) with standard implementation if you may use them. Otherwise consider making an easy randomized quicksort (code is even on Wikipedia)).

Afterwards scan it for one additional time. During that scan simple eliminate consecutive identical elements.

If you want to do it in O(n), you can also use a HashSet with elements you have already seen. Just iterate once over your array, for each element check if it is in your HashSet.

If it isn't in there, add it. If it is in there, remove it from the array.

Note, that this will take some additional memory and the hashing will have a constant factor that contributes to your runtime. Although the time complexity is better, the practical runtime will only be only be faster once you exceed a certain array size

edited Jul 2, 2024 at 14:17

stephenarosaj

206 bronze badges

answered May 20, 2011 at 12:48

b.buchhold

3,9062 gold badges29 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

BlackBear Over a year ago

Could you explain more deeply the hashset idea please?

Matthieu M. Over a year ago

@BlackBear: this is the same idea that Gumbo explained, a hash set is just a name of a hash table in which the key is also the value.

b.buchhold Over a year ago

A HashSet is a data structure that supports insert and membership test in constant time. In your case you certainly don't want to implement such a data structure on your own but use and existing one for your programming languages instead. The set will allow adding keys and checking if they are already contained in the set. Since both operations are supported in constant time and you do 1 membership test (+ 1 insert or delete from your array) for each element, you end up with O(n). Note that this requires delete/remove to happen in constant time.

Tripp Kinetics Over a year ago

Except that operations on a HashSet are average case O(1). Worst case is O(n) (if you have a meshugganah hash function), so you can only guarantee O(n^2) for the whole algorithm.

Gumbo · Accepted Answer · 2011-05-20 12:47:07Z

6

You can often use a space-time tradeoff and invest more space to reduce time.

In this case you could use a hash table to determine the unique words.

answered May 20, 2011 at 12:47

Gumbo

657k112 gold badges792 silver badges852 bronze badges

3 Comments

BlackBear Over a year ago

+1 Great, I thought about this too, but couldn't find the hash function. Could you provide one, please?

Gumbo Over a year ago

@BlackBear: Many programming languages already have such a data structure that allows a mapping of keys onto values.

Matthieu M. Over a year ago

@BlackBear: don't worry about the hash function, most languages have one for the strings already built-in.

Robin Green · Accepted Answer · 2011-05-20 12:50:53Z

2

add is O(n), so your CC calculation is wrong. Your algorithm is O(n^2).

Moreover, how would remove be implemented? It also looks like it would be O(n) - so the initial algorithm would be O(n^3).

answered May 20, 2011 at 12:50

Robin Green

33.4k20 gold badges112 silver badges197 bronze badges

Comments

OpenSauce · Accepted Answer · 2011-05-20 12:40:35Z

0

Binary search will only work if the array you're searching is sorted. I guess that's not the case here, or you wouldn't be looping over your entire array in the inner loop of the original solution.

answered May 20, 2011 at 12:40

OpenSauce

8,6731 gold badge26 silver badges30 bronze badges

2 Comments

BlackBear Over a year ago

The binary search is applied to vS, not v (which is the original array). I keep it sorted, inserting elements in their correct place.

OpenSauce Over a year ago

@BlackBear: ah yes, I read it too fast ; ). In that case, it looks right to me, assuming vS can be initialised to contain values which are not in v

gung - Reinstate Monica · Accepted Answer · 2017-07-20 15:39:20Z

0

If the order of the final solution is irrelevant, you could break the array into smaller arrays based on length of the strings, and then remove duplicates from those arrays. Example:

// You have 
{"a", "ab", "b", "ab", "a", "c", "cd", "cd"}, 

// you break it into 
{"a", "b", "a", "c"} and {"ab", "ab", "cd", "cd"}, 

// remove duplicates from those arrays using the merge method that others have mentioned, 
// and then combine the arrays back together into 
{"a", "b", "c", "ab", "cd"}

edited Jul 20, 2017 at 15:39

gung - Reinstate Monica

12k8 gold badges64 silver badges80 bronze badges

answered Jul 20, 2017 at 15:14

its_me

112 bronze badges

Comments

Rahul Agarwal · Accepted Answer · 2018-10-15 18:04:39Z

0

This is the shortest algorithm that worked where arrNames and arrScores is parallel arrays and the highest score is taken.

I := 0;
J := 0;
//iCount being the length of the array

for I := 1 to iCount do
for J := I + 1 to iCount do

   if arrNames[I] = arrNames[J] then
   begin

     if arrScores[I] <= arrScores[J] then
     arrScores[I] := arrScores[J];

   arrScores[J] := arrScores[iCount];
   arrNames[J] := arrNames[iCount];

   arrScores[iCount] := 0;
   arrNames[iCount] := '';

   Dec(iCount);
   end;

edited Oct 15, 2018 at 18:04

Rahul Agarwal

4,1168 gold badges33 silver badges56 bronze badges

answered Oct 15, 2018 at 17:13

Verushan Naidoo

1

Comments

ccie18643 · Accepted Answer · 2021-01-10 19:58:31Z

0

def dedup(l):
    ht, et = [(None, None) for _ in range(len(l))], []
    for e in l:
        h, n = hash(e), h % len(ht)
        while True:
            if ht[n][0] is None:
                et.append(e)
                ht[n] = h, len(et) - 1
            if ht[n][0] == h and et[ht[n][1]] == e:
                break
            if (n := n + 1) == len(ht):
                n = 0
    return et

edited Jan 10, 2021 at 19:58

answered Jan 10, 2021 at 19:21

ccie18643

354 bronze badges

Collectives™ on Stack Overflow

Best algorithm for delete duplicates in array of strings

7 Answers 7

4 Comments

3 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

3 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related