Ruby - Optimize the comparison of two arrays with duplicates

Question

I have the following arrays:

A = "cheddar".split(//)  # ["c", "h", "e", "d", "d", "a", "r"]
B = "cheddaar".split(//) # ["c", "h", "e", "d", "d", "a", "a", "r"]

The A array is a subset of the B array. If the A array had another "d" element, it wouldn't be a subset.

I want to compare and find if one is a subset of the other even if they have duplicates. The A - B or the A & B doesn't capture the duplicates it just compares them and find them matched. So I wrote the following which it captures the duplicates:

B.each do |letter|
    A.delete_at(A.index(letter)) rescue ""
end

p A.empty?

Is this the best way or can it be optimized?

What do you mean by duplicates? Can give an example? This will help in determining why A-B wouldn't work. — Harish Shetty
– Harish Shetty, Commented Jul 5, 2012 at 17:49
I mean to have the same element more than one time. If for example we have the following two tables: A = ["a","b","c","d"] and B = ["a", "a", "b", "c", "d"] with the commands A-B or A&B we can't determine the difference between them. With the code I present, the A array will have the duplicate element in the end so you can determine that the arrays aren't the same. — JohnDel
– JohnDel, Commented Jul 5, 2012 at 17:55
I find this statement conflicts with your comment If the A array had another "d" element, it wouldn't be a subset. In your question can give examples for strings that can be considered subset and ones that are not. — Harish Shetty
– Harish Shetty, Commented Jul 5, 2012 at 17:57
Basically I want to find if all of the elements in the A array, exists in the B array. But I want as well if an element exists two times in the A array, to exist two or more times in the B array. — JohnDel
– JohnDel, Commented Jul 5, 2012 at 18:03
i think just considering each element as unique in the corresponding array will help in better clarity, also can B be a subset of A? @JohnDel — Bijendra
– Bijendra, Commented Jul 5, 2012 at 18:04

Community · Accepted Answer · 2017-05-23 11:58:35Z

3

No idea if this is actually faster than your approach, but its runtime should be O(N+M) where N,M is size of a,b. (Assuming hash lookup and insert is ammortized O(1) which isn't strictly true as hash is usually a function of key size; though in the example, all keys are single characters.) Your looping #delete_at of #index approach has substantial extra data motion and looks like it may be worst case O(N^2 * M).

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

The OP asked for the fastest way, so I whipped up a little micro-benchmark available at this gist.

I tested the OP's original approach (op_del), my version of using reduce counts (ct), and the variant where the count array is reused (ct_acc), and the MultiSet approach (mset), EDIT and added the very concise find of count comparisons (slow_ct) . Ran each variant against the OP's small array input example (s), larger sets of cardinality 10,000 (b), and small set against big set (sb). (Had to reduce iteration count for the big set cases by an order of magnitude to get _slow_ct_ to complete in reasonable time.) Results here:

                     user     system      total        real
s_op_del         1.850000   0.000000   1.850000 (  1.853931)
s_ct             2.260000   0.000000   2.260000 (  2.264028)
s_ct_acc         1.700000   0.000000   1.700000 (  1.706881)
s_mset           5.460000   0.000000   5.460000 (  5.484833)
s_slow_ct        1.720000   0.000000   1.720000 (  1.731367)
b_op_del         0.310000   0.000000   0.310000 (  0.312804)
b_ct             0.120000   0.000000   0.120000 (  0.123329)
b_ct_acc         0.100000   0.000000   0.100000 (  0.101532)
b_mset           0.310000   0.000000   0.310000 (  0.319697)
b_slow_ct       82.910000   0.000000  82.910000 ( 83.013747)
sb_op_del        0.710000   0.020000   0.730000 (  0.734022)
sb_ct            0.050000   0.000000   0.050000 (  0.054416)
sb_ct_acc        0.040000   0.000000   0.040000 (  0.059032)
sb_mset          0.110000   0.000000   0.110000 (  0.117027)
sb_slow_ct       0.010000   0.000000   0.010000 (  0.011287)

The reduce count, reusing the count accumulator is the clear winner. Multiset was disappointingly slow.

edited May 23, 2017 at 11:58

CommunityBot

11 silver badge

answered Jul 5, 2012 at 19:56

dbenhur

20.5k4 gold badges52 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

danieltahara Over a year ago

if a_key isn't in b, then the < comparison will give an error

dbenhur Over a year ago

@danieltahara The count hashes were initialized with Hash.new(0) so lookup of non-existent keys will return 0.

Mark Thomas Over a year ago

Can you post a gist of your benchmark code? Mine found Multiset to be speedy.

danieltahara Over a year ago

@dbenhur -- totally missed, that good point. From a more granular/slightly off-topic perspective, what's the cost to initializing a Hash with a value other than nil?

Mark Thomas · Accepted Answer · 2012-07-05 20:24:49Z

2

If I understand the requirements correctly, you can use the multiset gem.

require 'multiset'
a = Multiset.new "cheddar".split(//)
b = Multiset.new "cheddaar".split(//)

a.subset? b #=> true

edited Jul 5, 2012 at 20:24

answered Jul 5, 2012 at 20:03

Mark Thomas

37.6k11 gold badges78 silver badges101 bronze badges

3 Comments

dbenhur Over a year ago

Set losses cardinality of members. "cheddar".split(//).to_set #=> #<Set: {"c", "h", "e", "d", "a", "r"}>

dbenhur Over a year ago

Note multiset is a gem and not ruby stdlib.

Mark Thomas Over a year ago

Hmm, that's true. Upon re-reading, cardinality is required by the OP. I updated it to the multiset gem.

davidrac · Accepted Answer · 2012-07-05 21:36:21Z

2

If I remember correctly, your solution is O(n^2) This one is a bit cumbersome, but more efficient, at least for large inputs (this is O(n)). It might need some more work...

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char{|x| letters[x] -= 1}
    letters.values.all?{|v| v >= 0 }
end

Edit: a bit more efficient:

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char.all?{|x| (letters[x] -= 1) > 0}
end

edited Jul 5, 2012 at 21:36

answered Jul 5, 2012 at 20:25

davidrac

10.8k3 gold badges45 silver badges71 bronze badges

2 Comments

dbenhur Over a year ago

This is substantially similar to my answer. Nice insight to reuse the counts hash. Would be cleaner to use each_char instead of split.each, reduce instead of init+each, and your final inject is essentially Enumerable#all? without the benefit of terminating early on first false block eval.

davidrac Over a year ago

thanks. good comments. keep them coming, I guess that with some thought, this can become much more elegant :)

danieltahara · Accepted Answer · 2012-07-07 04:17:44Z

2

Definitely want to take advantage of enumerators here -- the best way to go about this is to probably use group_by and compare the number of times each letter appears:

def subset?(a, b)
   a = a.each_char.group_by { |char| char }
   b = b.each_char.group_by { |char| char }
   a.each_key.all? do |letter|
     b[letter] && a[letter].size < b[letter].size
   end
end

So if we count hash lookups as O(1) operations, then this is an O(m + n) algorithm

edited Jul 7, 2012 at 4:17

answered Jul 7, 2012 at 4:10

danieltahara

5,1833 gold badges21 silver badges20 bronze badges

Comments

pguardiario · Accepted Answer · 2012-07-09 06:20:29Z

1

A similar question was posted a few weeks ago and I got the accepted answer with something like:

def is_subset?(a,b)
  !a.find{|x| a.count(x) > b.count(x)}
end

Benchmark Update

require 'benchmark'

def random_char
  ('a'..'z').to_a.sample
end

A = 8.times.map{random_char}
B = 8.times.map{random_char}

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

Benchmark.bm do |x|
  x.report('me')     {100000.times{is_subset?(A,B)}}
  x.report('dbenhur'){100000.times{ary_subset?(A,B)}}
end

       user     system      total        real
me  0.375000   0.000000   0.375000 (  0.384022)
dbenhur  2.558000   0.000000   2.558000 (  2.550146)

edited Jul 9, 2012 at 6:20

answered Jul 6, 2012 at 1:36

pguardiario

55.2k21 gold badges130 silver badges169 bronze badges

6 Comments

davidrac Over a year ago

This is very nicely written, but not very efficient. I think this is O(M*N^2)

pguardiario Over a year ago

It doesn't matter. It wins by skipping the overhead of creating multiple hashes (or multisets). Especially when there's a hit early. When there's no hit it still performs about as well as the rest of these.

dbenhur Over a year ago

No it doesn't win by skipping "overhead"; benchmarks show it at rough parity for small sets and orders of magnitude slower for large sets.

pguardiario Over a year ago

@dbenhur - your benchmark is flawed because it doesn't use strings. I'm assuming OP wants to optimize for strings and not (0...10000).to_a ;)

dbenhur Over a year ago

Your benchmark is quite flawed for many cases. You make sets picking 8 elements from a range of 26, so your subset test will fail very early, often on the first member test. String comparison vs fixnum comparison isn't a big difference for short strings, but comparing dramatically unlike small sets, has very different performance than comparing substantially similar large sets.

|

Harish Shetty · Accepted Answer · 2012-07-07 03:54:36Z

1

Try this:

class String
  def subset_of?(str)
    e2 = str.each_char
    c2 = c2p = nil
    each_char do |c1|
      c2p, c2 = c2, e2.next
      next if c2 == c1    
      c2p, c2 = c2, e2.next until (c2 != c2p) # move until we exclude duplicates
      return false if c2 != c1
    end
    true
  rescue StopIteration
    false
  end
end

Test the function:

>> "chedddar".subset_of?("cheddaaaaaar")
=> false
>> "cheddar".subset_of?("cheddaaaaaar")
=> true
>> "cheddar".subset_of?("cheddaaaaaarkkkk")
=> true
>> "chedddar".subset_of?("cheddar")
=> false
>> "chedddar".subset_of?("chedd")
=> false

Edit 1

Updated the solution based on the additional information provided.

class String
  def subset_of?(str)
    h1, h2 = [self, str].map {|s| s.each_char.reduce(Hash.new(0)){|h, c| h[c] += 1; h}}
    h1.all?{|c, k| h2[c] >= k}
  end
end

edited Jul 7, 2012 at 3:54

answered Jul 5, 2012 at 19:01

Harish Shetty

64.4k21 gold badges154 silver badges201 bronze badges

10 Comments

JohnDel Over a year ago

Very nice approach, but I also want to do this >>p "cheddar".subset_of?("cheaddr") should return true . So I don't want to take into account the ordering of the string / array. My main concern is the time response to be as fast as possbile, ideally faster than my approach.

Harish Shetty Over a year ago

Updated my answer. Take a look.

Harish Shetty Over a year ago

Updated again with a slightly faster version.

dbenhur Over a year ago

@KandadaBoggu your revised (Edit1) answer is similar to my approach, but you will do N passes over str for O(N*M) cost where you could get O(N+M) cost be accumulating the counts for str in one #reduce.

dbenhur Over a year ago

@KandadaBoggu Why would Enumerable#all? need Rails to avoid error?

|

Tanzeeb Khalili · Accepted Answer · 2012-07-06 07:12:05Z

-1

How about removing the dupes first?

(A.uniq - B.uniq).empty?

answered Jul 6, 2012 at 7:12

Tanzeeb Khalili

7,3442 gold badges25 silver badges27 bronze badges

Collectives™ on Stack Overflow

Ruby - Optimize the comparison of two arrays with duplicates

7 Answers 7

4 Comments

3 Comments

2 Comments

Comments

6 Comments

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

3 Comments

2 Comments

Comments

6 Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related