3

This question was motivated from another Stack Overflow question - How do I improve remove duplicate algorithm?

The requirement posted in the questions was -

need to return the length of an array that removed duplicates but we can leave at most 2 duplicates.

Example - [1, 1, 1, 2, 2, 3] , the new array would be [1, 1, 2, 2, 3] . So the new length would be 5 .

The solution given by the OP -

def removeDuplicates(nums):
    if nums is None:
        return 0
    if len(nums) == 0:
        return 0
    if len(nums) == 1:
        return 1
    new_array = {}
    for num in nums:
        new_array[num] = new_array.get(num, 0) + 1
    new_length = 0
    for key in new_array:
        if new_array[key] > 2:
            new_length = new_length + 2
        else:
            new_length = new_length + new_array[key]
    return new_length

I tried coming up with a solution that reduced the amount of loops to a single loop.

def removeDuplicates1(nums):
    if nums is None:
        return 0
    if len(nums) == 0:
        return 0
    if len(nums) == 1:
        return 1
    new_array = {}
    length = 0
    for num in nums:
        n = new_array.get(num, 0)
        new_array[num] = n + 1
        if n <= 1:
            length += 1
    return length

After that, I was trying to time the solution vs the original solution , I thought my solution should have provided atleast a little improvement on the original solution , but the result of timeit showed that the original solution was always better (even when the array contained all unique elements) . Timings taken -

In [3]: l = list(range(1000))

In [4]: %timeit removeDuplicates(l)
1000 loops, best of 3: 390 s per loop

In [5]: %timeit removeDuplicates1(l)
1000 loops, best of 3: 412 s per loop

In [6]: l1 = [1] * 1000

In [7]: %timeit removeDuplicates(l1)
1000 loops, best of 3: 224 s per loop

In [9]: %timeit removeDuplicates1(l1)
1000 loops, best of 3: 304 s per loop

Could someone please advice why this is happenning? Am I overlooking something obvious?

12
  • Wouldn't you call that 'remove triplicates'? Commented Jul 17, 2015 at 19:38
  • Well, I just reused the same name that the OP of the other thread had used :) Commented Jul 17, 2015 at 19:40
  • 1
    And new_array isn't an array, it's a dict -- OP of the other thread is bad at naming things :/ Commented Jul 17, 2015 at 19:40
  • sum(min(2, count) for count in collections.Counter(nums).values()) Commented Jul 17, 2015 at 19:49
  • @StevenRumbalski its still coming slower than the original method . - In [19]: %timeit removeDuplicates(l) 1000 loops, best of 3: 373 s per loop In [20]: %timeit rD2(l) 1000 loops, best of 3: 385 s per loop . Where rD2() is the function i defined for your method Commented Jul 17, 2015 at 20:02

1 Answer 1

3

If the input list is list(range(x)), meaning no duplicates, then your code is faster, but if the input list has a significant number of duplicates, then your code is slower.

I consistently got timings with

collections.defaultdict - fastest
original proposal - next fastest (if duplicates)
your single loop proposal - slower, if there are duplicates
collections.counter - slowest

They are all basically the same thing, so they were always close in time.

defaultdict is the fastest because the original proposal basically duplicates it, but defaultdict is part of the core libraries that ship with python. I guess "don't reinvent the wheel" applies.

But why is your code slower when it uses a single loop? Consider that the original code does two loops because there are two different things to iterate over. Iterate over the original data list once, then iterate over the unique items (which may be fewer because duplicates are expected).

Your code does everything the original code does, but it does it for every element in the original data list. Think of it like two separate loops with one loop counter for both. You still perform the first loop for all elements in the original list, as you must. But the second loop (which you try to get rid of by performing it inside the original loop) now must execute its code for every item in the original data set.

What you gained from having one loop you lost from executing it more often, specifically for the duplicates in the original data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.