1

I have the following arrays:

A = "cheddar".split(//)  # ["c", "h", "e", "d", "d", "a", "r"]
B = "cheddaar".split(//) # ["c", "h", "e", "d", "d", "a", "a", "r"]

The A array is a subset of the B array. If the A array had another "d" element, it wouldn't be a subset.

I want to compare and find if one is a subset of the other even if they have duplicates. The A - B or the A & B doesn't capture the duplicates it just compares them and find them matched. So I wrote the following which it captures the duplicates:

B.each do |letter|
    A.delete_at(A.index(letter)) rescue ""
end

p A.empty?

Is this the best way or can it be optimized?

7
  • What do you mean by duplicates? Can give an example? This will help in determining why A-B wouldn't work. Commented Jul 5, 2012 at 17:49
  • I mean to have the same element more than one time. If for example we have the following two tables: A = ["a","b","c","d"] and B = ["a", "a", "b", "c", "d"] with the commands A-B or A&B we can't determine the difference between them. With the code I present, the A array will have the duplicate element in the end so you can determine that the arrays aren't the same. Commented Jul 5, 2012 at 17:55
  • I find this statement conflicts with your comment If the A array had another "d" element, it wouldn't be a subset. In your question can give examples for strings that can be considered subset and ones that are not. Commented Jul 5, 2012 at 17:57
  • Basically I want to find if all of the elements in the A array, exists in the B array. But I want as well if an element exists two times in the A array, to exist two or more times in the B array. Commented Jul 5, 2012 at 18:03
  • i think just considering each element as unique in the corresponding array will help in better clarity, also can B be a subset of A? @JohnDel Commented Jul 5, 2012 at 18:04

7 Answers 7

3

No idea if this is actually faster than your approach, but its runtime should be O(N+M) where N,M is size of a,b. (Assuming hash lookup and insert is ammortized O(1) which isn't strictly true as hash is usually a function of key size; though in the example, all keys are single characters.) Your looping #delete_at of #index approach has substantial extra data motion and looks like it may be worst case O(N^2 * M).

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

The OP asked for the fastest way, so I whipped up a little micro-benchmark available at this gist.

I tested the OP's original approach (op_del), my version of using reduce counts (ct), and the variant where the count array is reused (ct_acc), and the MultiSet approach (mset), EDIT and added the very concise find of count comparisons (slow_ct) . Ran each variant against the OP's small array input example (s), larger sets of cardinality 10,000 (b), and small set against big set (sb). (Had to reduce iteration count for the big set cases by an order of magnitude to get _slow_ct_ to complete in reasonable time.) Results here:

                     user     system      total        real
s_op_del         1.850000   0.000000   1.850000 (  1.853931)
s_ct             2.260000   0.000000   2.260000 (  2.264028)
s_ct_acc         1.700000   0.000000   1.700000 (  1.706881)
s_mset           5.460000   0.000000   5.460000 (  5.484833)
s_slow_ct        1.720000   0.000000   1.720000 (  1.731367)
b_op_del         0.310000   0.000000   0.310000 (  0.312804)
b_ct             0.120000   0.000000   0.120000 (  0.123329)
b_ct_acc         0.100000   0.000000   0.100000 (  0.101532)
b_mset           0.310000   0.000000   0.310000 (  0.319697)
b_slow_ct       82.910000   0.000000  82.910000 ( 83.013747)
sb_op_del        0.710000   0.020000   0.730000 (  0.734022)
sb_ct            0.050000   0.000000   0.050000 (  0.054416)
sb_ct_acc        0.040000   0.000000   0.040000 (  0.059032)
sb_mset          0.110000   0.000000   0.110000 (  0.117027)
sb_slow_ct       0.010000   0.000000   0.010000 (  0.011287)

The reduce count, reusing the count accumulator is the clear winner. Multiset was disappointingly slow.

Sign up to request clarification or add additional context in comments.

4 Comments

if a_key isn't in b, then the < comparison will give an error
@danieltahara The count hashes were initialized with Hash.new(0) so lookup of non-existent keys will return 0.
Can you post a gist of your benchmark code? Mine found Multiset to be speedy.
@dbenhur -- totally missed, that good point. From a more granular/slightly off-topic perspective, what's the cost to initializing a Hash with a value other than nil?
2

If I understand the requirements correctly, you can use the multiset gem.

require 'multiset'
a = Multiset.new "cheddar".split(//)
b = Multiset.new "cheddaar".split(//)

a.subset? b #=> true

3 Comments

Set losses cardinality of members. "cheddar".split(//).to_set #=> #<Set: {"c", "h", "e", "d", "a", "r"}>
Note multiset is a gem and not ruby stdlib.
Hmm, that's true. Upon re-reading, cardinality is required by the OP. I updated it to the multiset gem.
2

If I remember correctly, your solution is O(n^2) This one is a bit cumbersome, but more efficient, at least for large inputs (this is O(n)). It might need some more work...

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char{|x| letters[x] -= 1}
    letters.values.all?{|v| v >= 0 }
end

Edit: a bit more efficient:

def is_subset?(a, b)
    letters = Hash.new(0)
    a.each_char{|x| letters[x] += 1}
    b.each_char.all?{|x| (letters[x] -= 1) > 0}
end

2 Comments

This is substantially similar to my answer. Nice insight to reuse the counts hash. Would be cleaner to use each_char instead of split.each, reduce instead of init+each, and your final inject is essentially Enumerable#all? without the benefit of terminating early on first false block eval.
thanks. good comments. keep them coming, I guess that with some thought, this can become much more elegant :)
2

Definitely want to take advantage of enumerators here -- the best way to go about this is to probably use group_by and compare the number of times each letter appears:

def subset?(a, b)
   a = a.each_char.group_by { |char| char }
   b = b.each_char.group_by { |char| char }
   a.each_key.all? do |letter|
     b[letter] && a[letter].size < b[letter].size
   end
end

So if we count hash lookups as O(1) operations, then this is an O(m + n) algorithm

Comments

1

A similar question was posted a few weeks ago and I got the accepted answer with something like:

def is_subset?(a,b)
  !a.find{|x| a.count(x) > b.count(x)}
end

Benchmark Update

require 'benchmark'

def random_char
  ('a'..'z').to_a.sample
end

A = 8.times.map{random_char}
B = 8.times.map{random_char}

def ary_subset?(a,b) # true iff a is subset of b
  a_counts = a.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  b_counts = b.reduce(Hash.new(0)) { |m,v| m[v] += 1; m }
  a_counts.all? { |a_key,a_ct| a_ct <= b_counts[a_key] }
end

Benchmark.bm do |x|
  x.report('me')     {100000.times{is_subset?(A,B)}}
  x.report('dbenhur'){100000.times{ary_subset?(A,B)}}
end

       user     system      total        real
me  0.375000   0.000000   0.375000 (  0.384022)
dbenhur  2.558000   0.000000   2.558000 (  2.550146)

6 Comments

This is very nicely written, but not very efficient. I think this is O(M*N^2)
It doesn't matter. It wins by skipping the overhead of creating multiple hashes (or multisets). Especially when there's a hit early. When there's no hit it still performs about as well as the rest of these.
No it doesn't win by skipping "overhead"; benchmarks show it at rough parity for small sets and orders of magnitude slower for large sets.
@dbenhur - your benchmark is flawed because it doesn't use strings. I'm assuming OP wants to optimize for strings and not (0...10000).to_a ;)
Your benchmark is quite flawed for many cases. You make sets picking 8 elements from a range of 26, so your subset test will fail very early, often on the first member test. String comparison vs fixnum comparison isn't a big difference for short strings, but comparing dramatically unlike small sets, has very different performance than comparing substantially similar large sets.
|
1

Try this:

class String
  def subset_of?(str)
    e2 = str.each_char
    c2 = c2p = nil
    each_char do |c1|
      c2p, c2 = c2, e2.next
      next if c2 == c1    
      c2p, c2 = c2, e2.next until (c2 != c2p) # move until we exclude duplicates
      return false if c2 != c1
    end
    true
  rescue StopIteration
    false
  end
end

Test the function:

>> "chedddar".subset_of?("cheddaaaaaar")
=> false
>> "cheddar".subset_of?("cheddaaaaaar")
=> true
>> "cheddar".subset_of?("cheddaaaaaarkkkk")
=> true
>> "chedddar".subset_of?("cheddar")
=> false
>> "chedddar".subset_of?("chedd")
=> false

Edit 1

Updated the solution based on the additional information provided.

class String
  def subset_of?(str)
    h1, h2 = [self, str].map {|s| s.each_char.reduce(Hash.new(0)){|h, c| h[c] += 1; h}}
    h1.all?{|c, k| h2[c] >= k}
  end
end

10 Comments

Very nice approach, but I also want to do this >>p "cheddar".subset_of?("cheaddr") should return true . So I don't want to take into account the ordering of the string / array. My main concern is the time response to be as fast as possbile, ideally faster than my approach.
Updated my answer. Take a look.
Updated again with a slightly faster version.
@KandadaBoggu your revised (Edit1) answer is similar to my approach, but you will do N passes over str for O(N*M) cost where you could get O(N+M) cost be accumulating the counts for str in one #reduce.
@KandadaBoggu Why would Enumerable#all? need Rails to avoid error?
|
-1

How about removing the dupes first?

(A.uniq - B.uniq).empty?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.