5

I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?

all = [*1..35000, 1]

This solution works:

all.select { |v| all.count(v) > 1 }

But it takes a long time.

1
  • 1
    "This solution works: [...] But it takes a long time." – did you see the comments / other answers on that page? Commented Aug 16, 2018 at 7:37

5 Answers 5

10

Your code is taking an eon to execute because it is executing count for each element, resulting in it having a computational complexity of O(n2).

arr = [*1..35000, 1, 34999]

If you want to know which values appear in the array at least twice...

require 'set'

uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
  #=> [1, 34999]

Set lookups (implemented with a hash under the covers) are extremely fast.

See Set#add? and Set#add.

If you want to know the numbers of times values appear in the array that appear at least twice...

arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
  #=> {1=>2, 34999=>2}

This uses a counting hash1. See Hash::new when it takes a default value as an argument.

If you want to know the indices of values that appear in the array at least twice...

arr.each_with_index.
    with_object({}) { |(x,i),h| (h[x] ||= []) << i }.
    select { |_,v| v.size > 1 }
  #=> {1=>[0, 35000], 34999=>[34998, 35001]}

When the hash h does not already have a key x,

(h[x] ||= []) << i
   #=> (h[x] = h[x] || []) << i
   #=> (h[x] = nil || []) << i
   #=> (h[x] = []) << i
   #=> [] << i where [] is now h[x]

1. Ruby v2.7 gave us the method Enumerable#tally, allowing us to write arr.tally.select { |_,v| v > 1 }.

Sign up to request clarification or add additional context in comments.

1 Comment

I was writing something similar. When it comes to uniqueness of elements in a large set, just use Set.
7

If you're on Ruby 2.4.0+, you can use group_by + Hash#transform_values (available Ruby 2.4.0):

all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }

See it in action:

all.group_by(&:itself) groups how many times an element appears:

{
  1 => [1, 1],
  2 => [2],

  ...

  35000 => [35000]
}

Then let's turn the values of above hash into frequency:

all.group_by(&:itself).transform_values(&:size):

{
  1 => 2,
  2 => 1,

  ...

  35000 => 1
}

Benchmark:

def measure_execution_time
  started = Time.now
  yield
  Time.now - started
end

measure_execution_time do
  all.select { |value| all.count(value) > 1 }
end
=> 7.235489


measure_execution_time do
  all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
end
=> 0.017887

2 Comments

Damn, that's clever. Minor improvement to match the return value would be to select { |_, freq| freq > 1 }.keys to return an array again, instead of a hash.
The goal here is [1] in the output array, so like Brian says, just one step to go and this does it all.
2

Cary’s solution seems to be the fastest here so far. Here is mine:

large_array.sort.each_cons(2).with_object(Set.new) do |(e1, e2), acc|
  acc << e1 if e1 == e2
end.to_a

NB: unlike Cary’s solution, and like juanitofatas’s one, this might be easily adapted to find those having more than N occurrences (just change the parameter to each_cons.

Also, if the original array might be mutated, this consumes the least amount of memory amongst others (sortsort!.)


Benchmarks:

require 'set'
require 'benchmark'

def mudsie arr 
  arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
    acc << e1 if e1 == e2
  end.to_a
end

def cary arr
  uniq_set = Set.new
  arr.each_with_object(Set.new) do |x,dup_set|
    uniq_set.add?(x) || dup_set.add(x)
  end.to_a
end

def juanitofatas arr 
  arr.group_by(&:itself).transform_values(&:size).select do |_, freq|
    freq > 1
  end.keys
end

arr = [*(1..35000)]
arr = (arr + arr.sample(500)).shuffle

n = 500

Benchmark.bm do |x| 
  x.report("juanitofatas") { n.times { juanitofatas arr } } 
  x.report("cary") { n.times { cary arr } } 
  x.report("mudsie") { n.times { mudsie arr } } 
end

        user     system      total        real
juanitofatas   4.321030   0.000000    4.321030 (  4.321232)
        cary   3.229409   0.032003    3.261412 (  3.261995)
      mudsie   3.798093   0.000000    3.798093 (  3.798141)

1 Comment

Thanks for the benchmark, mudsie. They're all pretty close, so close that any might rise to the top with different parameters.
1

Great replies!

My version inspired by Cary Swoveland but more verbose, just one line less than Nondv with select instead of select!, which seems even faster:

def igian(arr)
  h = Hash.new(0)
  arr.each { |a| h[a] += 1 }
  h.select { |_k, v| v > 1 }
end

Benchmark on all replies, inspired by mudasobwa.

n = 500
         user     system      total        real
cary1    5.040000   0.200000   5.240000 (  5.248103)
cary2    4.700000   0.190000   4.890000 (  4.911883)
juanito  7.430000   0.030000   7.460000 (  7.483123)
mudsie   5.430000   0.020000   5.450000 (  5.460839)
nondv1   4.720000   0.190000   4.910000 (  4.924792)
nondv2   5.110000   0.190000   5.300000 (  5.317148)
igian    4.310000   0.190000   4.500000 (  4.522211)

n = 1000
         user     system      total        real
cary1   10.460000   0.410000  10.870000 ( 10.900927)
cary2    9.550000   0.410000   9.960000 (  9.989021)
juanito 15.370000   0.160000  15.530000 ( 15.569288)
mudsie  10.920000   0.020000  10.940000 ( 10.972357)
nondv1   9.590000   0.410000  10.000000 ( 10.017669)
nondv2  10.340000   0.410000  10.750000 ( 10.774538)
igian    8.790000   0.400000   9.190000 (  9.213292)

Here is the benchmark code:

require 'benchmark'
require 'set'

arr = [*1..35000, 1]

def cary1(arr)
  uniq_set = Set.new
  arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
end

def cary2(arr)
  arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
end

def juanito(arr)
  arr.group_by(&:itself).transform_values(&:size).select { |_,v| v > 1 }
end

def mudsie(arr)
  arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
    acc << e1 if e1 == e2
  end.to_a
end

def nondv1(arr)
  counter = Hash.new(0)
  arr.each { |e| counter[e] += 1 }
  counter.select! { |k, v| v > 1 }
  counter.keys
end

def nondv2(arr)
  arr.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
     .select! { |k, v| v > 1 }
     .keys
end

def igian(arr)
  h = Hash.new(0)
  arr.each { |a| h[a] += 1 }
  h.select { |_k, v| v > 1 }
end

n = 500 #1000
Benchmark.bm do |x|
  x.report("cary1") { n.times { cary1 arr } }
  x.report("cary2") { n.times { cary2 arr } }
  x.report("juanito") { n.times { juanito arr } }
  x.report("mudsie") { n.times { mudsie arr } }
  x.report("nondv1") { n.times { nondv1 arr } }
  x.report("nondv2") { n.times { nondv2 arr } }
  x.report("igian") { n.times { igian arr } }
end

2 Comments

It’s funny that select is faster than select!.
Some of these methods output a list of values that occur in duplicate, while some output a hash with the values as keys and a count. I added .keys to those to compare apples to apples and got similar benchmarks, but there's only one dupe in these tests. So I changed the array to arr = [*1..35000, *1..10000] Interestingly, nondv1 was the top performer for me in the case of many duplicates, and it had little to do with the .keys conversion.
1

Can't you just count them yourself?

counter = Hash.new(0)
array.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys

# OR

array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
     .select! { |k, v| v > 1 }
     .keys

It has O(n) complexity and I don't think you can go faster than that (I mean faster than O(n))

1 Comment

Typo here: counter.select! { |k, v| > 1 } ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.