I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?
all = [*1..35000, 1]
This solution works:
all.select { |v| all.count(v) > 1 }
But it takes a long time.
I have an array with 35k elements. How may I efficiently find duplicates and return those duplicates?
all = [*1..35000, 1]
This solution works:
all.select { |v| all.count(v) > 1 }
But it takes a long time.
Your code is taking an eon to execute because it is executing count for each element, resulting in it having a computational complexity of O(n2).
arr = [*1..35000, 1, 34999]
If you want to know which values appear in the array at least twice...
require 'set'
uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
#=> [1, 34999]
Set lookups (implemented with a hash under the covers) are extremely fast.
If you want to know the numbers of times values appear in the array that appear at least twice...
arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
#=> {1=>2, 34999=>2}
This uses a counting hash1. See Hash::new when it takes a default value as an argument.
If you want to know the indices of values that appear in the array at least twice...
arr.each_with_index.
with_object({}) { |(x,i),h| (h[x] ||= []) << i }.
select { |_,v| v.size > 1 }
#=> {1=>[0, 35000], 34999=>[34998, 35001]}
When the hash h does not already have a key x,
(h[x] ||= []) << i
#=> (h[x] = h[x] || []) << i
#=> (h[x] = nil || []) << i
#=> (h[x] = []) << i
#=> [] << i where [] is now h[x]
1. Ruby v2.7 gave us the method Enumerable#tally, allowing us to write arr.tally.select { |_,v| v > 1 }.
Set.If you're on Ruby 2.4.0+, you can use group_by + Hash#transform_values (available Ruby 2.4.0):
all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
See it in action:
all.group_by(&:itself) groups how many times an element appears:
{
1 => [1, 1],
2 => [2],
...
35000 => [35000]
}
Then let's turn the values of above hash into frequency:
all.group_by(&:itself).transform_values(&:size):
{
1 => 2,
2 => 1,
...
35000 => 1
}
Benchmark:
def measure_execution_time
started = Time.now
yield
Time.now - started
end
measure_execution_time do
all.select { |value| all.count(value) > 1 }
end
=> 7.235489
measure_execution_time do
all.group_by(&:itself).transform_values(&:size).select { |_, freq| freq > 1 }
end
=> 0.017887
select { |_, freq| freq > 1 }.keys to return an array again, instead of a hash.[1] in the output array, so like Brian says, just one step to go and this does it all.Cary’s solution seems to be the fastest here so far. Here is mine:
large_array.sort.each_cons(2).with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
NB: unlike Cary’s solution, and like juanitofatas’s one, this might be easily adapted to find those having more than N occurrences (just change the parameter to each_cons.
Also, if the original array might be mutated, this consumes the least amount of memory amongst others (sort → sort!.)
Benchmarks:
require 'set'
require 'benchmark'
def mudsie arr
arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
end
def cary arr
uniq_set = Set.new
arr.each_with_object(Set.new) do |x,dup_set|
uniq_set.add?(x) || dup_set.add(x)
end.to_a
end
def juanitofatas arr
arr.group_by(&:itself).transform_values(&:size).select do |_, freq|
freq > 1
end.keys
end
arr = [*(1..35000)]
arr = (arr + arr.sample(500)).shuffle
n = 500
Benchmark.bm do |x|
x.report("juanitofatas") { n.times { juanitofatas arr } }
x.report("cary") { n.times { cary arr } }
x.report("mudsie") { n.times { mudsie arr } }
end
user system total real
juanitofatas 4.321030 0.000000 4.321030 ( 4.321232)
cary 3.229409 0.032003 3.261412 ( 3.261995)
mudsie 3.798093 0.000000 3.798093 ( 3.798141)
Great replies!
My version inspired by Cary Swoveland but more verbose, just one line less than Nondv with select instead of select!, which seems even faster:
def igian(arr)
h = Hash.new(0)
arr.each { |a| h[a] += 1 }
h.select { |_k, v| v > 1 }
end
Benchmark on all replies, inspired by mudasobwa.
n = 500
user system total real
cary1 5.040000 0.200000 5.240000 ( 5.248103)
cary2 4.700000 0.190000 4.890000 ( 4.911883)
juanito 7.430000 0.030000 7.460000 ( 7.483123)
mudsie 5.430000 0.020000 5.450000 ( 5.460839)
nondv1 4.720000 0.190000 4.910000 ( 4.924792)
nondv2 5.110000 0.190000 5.300000 ( 5.317148)
igian 4.310000 0.190000 4.500000 ( 4.522211)
n = 1000
user system total real
cary1 10.460000 0.410000 10.870000 ( 10.900927)
cary2 9.550000 0.410000 9.960000 ( 9.989021)
juanito 15.370000 0.160000 15.530000 ( 15.569288)
mudsie 10.920000 0.020000 10.940000 ( 10.972357)
nondv1 9.590000 0.410000 10.000000 ( 10.017669)
nondv2 10.340000 0.410000 10.750000 ( 10.774538)
igian 8.790000 0.400000 9.190000 ( 9.213292)
Here is the benchmark code:
require 'benchmark'
require 'set'
arr = [*1..35000, 1]
def cary1(arr)
uniq_set = Set.new
arr.each_with_object(Set.new) { |x,dup_set| uniq_set.add?(x) || dup_set.add(x) }.to_a
end
def cary2(arr)
arr.each_with_object(Hash.new(0)) { |x,h| h[x] += 1 }.select { |_,v| v > 1 }
end
def juanito(arr)
arr.group_by(&:itself).transform_values(&:size).select { |_,v| v > 1 }
end
def mudsie(arr)
arr.sort.each_cons(2).each_with_object(Set.new) do |(e1, e2), acc|
acc << e1 if e1 == e2
end.to_a
end
def nondv1(arr)
counter = Hash.new(0)
arr.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys
end
def nondv2(arr)
arr.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
.select! { |k, v| v > 1 }
.keys
end
def igian(arr)
h = Hash.new(0)
arr.each { |a| h[a] += 1 }
h.select { |_k, v| v > 1 }
end
n = 500 #1000
Benchmark.bm do |x|
x.report("cary1") { n.times { cary1 arr } }
x.report("cary2") { n.times { cary2 arr } }
x.report("juanito") { n.times { juanito arr } }
x.report("mudsie") { n.times { mudsie arr } }
x.report("nondv1") { n.times { nondv1 arr } }
x.report("nondv2") { n.times { nondv2 arr } }
x.report("igian") { n.times { igian arr } }
end
select is faster than select!.arr = [*1..35000, *1..10000] Interestingly, nondv1 was the top performer for me in the case of many duplicates, and it had little to do with the .keys conversion.Can't you just count them yourself?
counter = Hash.new(0)
array.each { |e| counter[e] += 1 }
counter.select! { |k, v| v > 1 }
counter.keys
# OR
array.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
.select! { |k, v| v > 1 }
.keys
It has O(n) complexity and I don't think you can go faster than that (I mean faster than O(n))
counter.select! { |k, v| > 1 } ;)