Fast way to find duplicate in large array

Question

I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?

all = [*1..35000, 1]

This solution works:

all.select { |v| all.count(v) > 1 }

But it takes a long time.

"This solution works: [...] But it takes a long time." – did you see the comments / other answers on that page? — Stefan
– Stefan, Commented Aug 16, 2018 at 7:37

Cary Swoveland · Accepted Answer · 2020-10-18 17:18:06Z

10

Your code is taking an eon to execute because it is executing count for each element, resulting in it having a computational complexity of O(n²).

arr = [*1..35000, 1, 34999]

If you want to know which values appear in the array at least twice...

require 'set'

uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
  #=> [1, 34999]

Set lookups (implemented with a hash under the covers) are extremely fast.

See Set#add? and Set#add.

If you want to know the numbers of times values appear in the array that appear at least twice...

arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
  #=> {1=>2, 34999=>2}

This uses a counting hash¹. See Hash::new when it takes a default value as an argument.

If you want to know the indices of values that appear in the array at least twice...

arr.each_with_index.
    with_object({}) { |(x,i),h| (h[x] ||= []) << i }.
    select { |_,v| v.size > 1 }
  #=> {1=>[0, 35000], 34999=>[34998, 35001]}

When the hash h does not already have a key x,

(h[x] ||= []) << i
   #=> (h[x] = h[x] || []) << i
   #=> (h[x] = nil || []) << i
   #=> (h[x] = []) << i
   #=> [] << i where [] is now h[x]

^{1. Ruby v2.7 gave us the method Enumerable#tally, allowing us to write arr.tally.select { |_,v| v > 1 }.}

edited Oct 18, 2020 at 17:18

answered Aug 16, 2018 at 4:42

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aleksei Matiushkin Over a year ago

I was writing something similar. When it comes to uniqueness of elements in a large set, just use Set.

Juanito Fatas · Accepted Answer · 2018-08-16 02:28:47Z

7

If you're on Ruby 2.4.0+, you can use group_by + Hash#transform_values (available Ruby 2.4.0):

all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }

See it in action:

all.group_by(&:itself) groups how many times an element appears:

{
  1 => [1, 1],
  2 => [2],

  ...

  35000 => [35000]
}

Then let's turn the values of above hash into frequency:

all.group_by(&:itself).transform_values(&:size):

{
  1 => 2,
  2 => 1,

  ...

  35000 => 1
}

Benchmark:

def measure_execution_time
  started = Time.now
  yield
  Time.now - started
end

measure_execution_time do
  all.select { |value| all.count(value) > 1 }
end
=> 7.235489


measure_execution_time do
  all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
end
=> 0.017887

answered Aug 16, 2018 at 2:28

Juanito Fatas

10.1k10 gold badges53 silver badges74 bronze badges

2 Comments

Brian Kung Over a year ago

Damn, that's clever. Minor improvement to match the return value would be to select { |_, freq| freq > 1 }.keys to return an array again, instead of a hash.

tadman Over a year ago

The goal here is [1] in the output array, so like Brian says, just one step to go and this does it all.

Aleksei Matiushkin · Accepted Answer · 2018-08-16 05:43:14Z

2

Cary’s solution seems to be the fastest here so far. Here is mine:

large_array.sort.each_cons(2).with_object(Set.new) do |(e1, e2), acc|
  acc << e1 if e1 == e2
end.to_a

NB: unlike Cary’s solution, and like juanitofatas’s one, this might be easily adapted to find those having more than N occurrences (just change the parameter to each_cons.

Also, if the original array might be mutated, this consumes the least amount of memory amongst others (sort → sort!.)

Benchmarks:

require 'set'
require 'benchmark'

def mudsie arr 
  arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
    acc << e1 if e1 == e2
  end.to_a
end

def cary arr
  uniq_set = Set.new
  arr.each_with_object(Set.new) do |x,dup_set|
    uniq_set.add?(x) || dup_set.add(x)
  end.to_a
end

def juanitofatas arr 
  arr.group_by(&:itself).transform_values(&:size).select do |_, freq|
    freq > 1
  end.keys
end

arr = [*(1..35000)]
arr = (arr + arr.sample(500)).shuffle

n = 500

Benchmark.bm do |x| 
  x.report("juanitofatas") { n.times { juanitofatas arr } } 
  x.report("cary") { n.times { cary arr } } 
  x.report("mudsie") { n.times { mudsie arr } } 
end

        user     system      total        real
juanitofatas   4.321030   0.000000    4.321030 (  4.321232)
        cary   3.229409   0.032003    3.261412 (  3.261995)
      mudsie   3.798093   0.000000    3.798093 (  3.798141)

edited Aug 16, 2018 at 5:43

answered Aug 16, 2018 at 5:02

Aleksei Matiushkin

121k12 gold badges109 silver badges174 bronze badges

1 Comment

Cary Swoveland Over a year ago

Thanks for the benchmark, mudsie. They're all pretty close, so close that any might rise to the top with different parameters.

iGian · Accepted Answer · 2018-08-16 12:39:17Z

1

Great replies!

My version inspired by Cary Swoveland but more verbose, just one line less than Nondv with select instead of select!, which seems even faster:

def igian(arr)
  h = Hash.new(0)
  arr.each { |a| h[a] += 1 }
  h.select { |_k, v| v > 1 }
end

Benchmark on all replies, inspired by mudasobwa.

n = 500
         user     system      total        real
cary1    5.040000   0.200000   5.240000 (  5.248103)
cary2    4.700000   0.190000   4.890000 (  4.911883)
juanito  7.430000   0.030000   7.460000 (  7.483123)
mudsie   5.430000   0.020000   5.450000 (  5.460839)
nondv1   4.720000   0.190000   4.910000 (  4.924792)
nondv2   5.110000   0.190000   5.300000 (  5.317148)
igian    4.310000   0.190000   4.500000 (  4.522211)

n = 1000
         user     system      total        real
cary1   10.460000   0.410000  10.870000 ( 10.900927)
cary2    9.550000   0.410000   9.960000 (  9.989021)
juanito 15.370000   0.160000  15.530000 ( 15.569288)
mudsie  10.920000   0.020000  10.940000 ( 10.972357)
nondv1   9.590000   0.410000  10.000000 ( 10.017669)
nondv2  10.340000   0.410000  10.750000 ( 10.774538)
igian    8.790000   0.400000   9.190000 (  9.213292)

Here is the benchmark code:

require 'benchmark'
require 'set'

arr = [*1..35000, 1]

def cary1(arr)
  uniq_set = Set.new
  arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
end

def cary2(arr)
  arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
end

def juanito(arr)
  arr.group_by(&:itself).transform_values(&:size).select { |_,v| v > 1 }
end

def mudsie(arr)
  arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
    acc << e1 if e1 == e2
  end.to_a
end

def nondv1(arr)
  counter = Hash.new(0)
  arr.each { |e| counter[e] += 1 }
  counter.select! { |k, v| v > 1 }
  counter.keys
end

def nondv2(arr)
  arr.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
     .select! { |k, v| v > 1 }
     .keys
end

def igian(arr)
  h = Hash.new(0)
  arr.each { |a| h[a] += 1 }
  h.select { |_k, v| v > 1 }
end

n = 500 #1000
Benchmark.bm do |x|
  x.report("cary1") { n.times { cary1 arr } }
  x.report("cary2") { n.times { cary2 arr } }
  x.report("juanito") { n.times { juanito arr } }
  x.report("mudsie") { n.times { mudsie arr } }
  x.report("nondv1") { n.times { nondv1 arr } }
  x.report("nondv2") { n.times { nondv2 arr } }
  x.report("igian") { n.times { igian arr } }
end

edited Aug 16, 2018 at 12:39

answered Aug 16, 2018 at 12:17

iGian

11.2k3 gold badges24 silver badges38 bronze badges

2 Comments

Aleksei Matiushkin Over a year ago

It’s funny that select is faster than select!.

Shane E. Over a year ago

Some of these methods output a list of values that occur in duplicate, while some output a hash with the values as keys and a count. I added .keys to those to compare apples to apples and got similar benchmarks, but there's only one dupe in these tests. So I changed the array to arr = [*1..35000, *1..10000] Interestingly, nondv1 was the top performer for me in the case of many duplicates, and it had little to do with the .keys conversion.

Nondv · Accepted Answer · 2018-08-16 13:11:37Z

1

Can't you just count them yourself?

counter = Hash.new(0)
array.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys

# OR

array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
     .select! { |k, v| v > 1 }
     .keys

It has O(n) complexity and I don't think you can go faster than that (I mean faster than O(n))

edited Aug 16, 2018 at 13:11

answered Aug 16, 2018 at 6:51

Nondv

8197 silver badges11 bronze badges

1 Comment

iGian Over a year ago

Typo here: counter.select! { |k, v| > 1 } ;)

Collectives™ on Stack Overflow

Fast way to find duplicate in large array

5 Answers 5

1 Comment

2 Comments

1 Comment

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

2 Comments

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related