Python list filtering: remove subsets from list of lists

Question

Using Python how do you reduce a list of lists by an ordered subset match [[..],[..],..]?

In the context of this question a list L is a subset of list M if M contains all members of L, and in the same order. For example, the list [1,2] is a subset of the list [1,2,3], but not of the list [2,1,3].

Example input:

a. [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

Expected result:

a. [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

Further Examples:

L = [[1, 2, 3, 4, 5, 6, 7], [1, 2, 5, 6]] - No reduce

L = [[1, 2, 3, 4, 5, 6, 7], ~~[1, 2, 3]~~, [1, 2, 4, 8]] - Yes reduce

L = [[1, 2, 3, 4, 5, 6, 7], [7, 6, 5, 4, 3, 2, 1]] - No reduce

(Sorry for causing confusion with the incorrect data set.)

What is a superset list? It's any set that does not appear as a subset of another? — hughdbrown
– hughdbrown, Commented Aug 23, 2009 at 16:28
No, [1,2,4,5,6] is a "subset" of [1, 2, 3, 4, 5, 6, 7], according to the problem definition. — João Silva
– João Silva, Commented Aug 23, 2009 at 17:26
I think you need to produce a definitive set of test cases - I'll be happy to write code against them. It seems that neither of my answers are entirely correct. — quamrana
– quamrana, Commented Aug 24, 2009 at 21:29
I don't understand. [1,2,4,5,6] is omitted in one test data set because of [1,2,3,4,5,6,7] but not in this test data? [[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 5, 6]] Am I reading the "No reduce" comment wrong? — hughdbrown
– hughdbrown, Commented Aug 24, 2009 at 21:49

iElectric · Accepted Answer · 2009-08-23 17:49:14Z

11

This could be simplified, but:

l = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
l2 = l[:]

for m in l:
    for n in l:
        if set(m).issubset(set(n)) and m != n:
            l2.remove(m)
            break

print l2
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

answered Aug 23, 2009 at 17:49

iElectric

5,8291 gold badge33 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hughdbrown Over a year ago

Sets, list comprehension, enumerate(): l2 = [m for i, m in enumerate(l) if not any(set(m).issubset(set(n)) for n in (l[:i] + l[i+1:]))]

Oliver Over a year ago

Input: [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] Out: [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8]] The sequence [1, 2, 4, 5, 6] has been lost?

hughdbrown Over a year ago

[1, 2, 4, 5, 6] should be dropped because it is an ordered subset of [1, 2, 3, 4, 5, 6, 7], right?

Kenan Banks · Accepted Answer · 2009-08-23 18:13:31Z

7

This code should be rather memory efficient. Beyond storing your initial list of lists, this code uses negligible extra memory (no temporary sets or copies of lists are created).

def is_subset(needle,haystack):
   """ Check if needle is ordered subset of haystack in O(n)  """

   if len(haystack) < len(needle): return False

   index = 0
   for element in needle:
      try:
         index = haystack.index(element, index) + 1
      except ValueError:
         return False
   else:
      return True

def filter_subsets(lists):
   """ Given list of lists, return new list of lists without subsets  """

   for needle in lists:
      if not any(is_subset(needle, haystack) for haystack in lists
         if needle is not haystack):
         yield needle

my_lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], 
            [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]    
print list(filter_subsets(my_lists))

>>> [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

And, just for fun, a one-liner:

def filter_list(L):
    return [x for x in L if not any(set(x)<=set(y) for y in L if x is not y)]

edited Aug 23, 2009 at 18:13

answered Aug 23, 2009 at 17:35

Kenan Banks

213k36 gold badges161 silver badges176 bronze badges

6 Comments

hughdbrown Over a year ago

This line is a good idea: "index = haystack.index(element, index)". Instead, I shortened the list each time.

hughdbrown Over a year ago

I'm guessing, though, that this code would say that [1,1,1,1,1,1] is a subset of [1]. You need "index = 1 + haystack.index(element, index)".

Kenan Banks Over a year ago

@hugh, your example would be handled by checking the lengths first, but you're right. [1,1,1] is a subset of [2,1,3] in this code. Changing it now.

Oliver Over a year ago

Same problem as @iElectric solution sequence is lost. In :[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] Out: [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8]]

Kenan Banks Over a year ago

1,2,4,5,6 is an ordered subset of 1,2,3,4,5,6,7. By your specs, it should be removed.

|

hughdbrown · Accepted Answer · 2009-08-24 21:45:57Z

1

A list is a superlist if it is not a subset of any other list. It's a subset of another list if every element of the list can be found, in order, in another list.

Here's my code:

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0
            for c in cand:
                i = 1 + target.index(c, i)
            return True
        except ValueError:
            return False
    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

if __name__ == '__main__':
    lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
    superlists = super_lists(lists)
    print superlists

Here are the results:

[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

Edit: Results for your later data set.

>>> lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17,
 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2,
 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
>>> superlists = super_lists(lists)
>>> expected = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [5
0, 69],  [2, 3, 21], [1, 2, 4, 8]]
>>> assert(superlists == expected)
>>> print superlists
[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3,
21], [1, 2, 4, 8]]

edited Aug 24, 2009 at 21:45

answered Aug 23, 2009 at 17:42

hughdbrown

49.3k20 gold badges89 silver badges111 bronze badges

4 Comments

Oliver Over a year ago

Same problem, sequences are lost

hughdbrown Over a year ago

Same problem as what? And what do you mean by "sequences are lost"? Does that mean it does not produce the desired results? If not, show an example. The code above generates the results shown.

hughdbrown Over a year ago

Okay, I tried it on your new data set and it produces exactly the results you expected/wanted.

Oliver Over a year ago

Asking this late at night wasn't a good thing I feel bad.. I left off the [1,2,4,5,6] in the expected result.

quamrana · Accepted Answer · 2009-08-23 17:18:24Z

0

This seems to work:

original=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

target=[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,aList):
        s=set(aList)
        if self.set.issubset(s):
            #print self.list,'superceded by',aList
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        for t in temp:
            t.compare(l)
        temp.append( SetAndList(l) )

    return [t.list for t in temp if t.isUnique]

print listReduce(original)
print target

This prints the calculated list and the target for visual comparison.

Uncomment the print line in the compare method to see how various lists get superceded.

Tested with python 2.6.2

answered Aug 23, 2009 at 17:18

quamrana

39.6k13 gold badges57 silver badges77 bronze badges

2 Comments

Oliver Over a year ago

Fails to reduce fully. If given [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] Output: [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] Fails to reduce [1,2,3] into one of the larger groups

quamrana Over a year ago

@OP: See my next answer today, Aug 24

João Silva · Accepted Answer · 2009-08-23 17:19:33Z

0

I implemented a different issubseq because yours doesn't say that [1, 2, 4, 5, 6] is a subsequence of [1, 2, 3, 4, 5, 6, 7], for example (besides being painfully slow). The solution I came up with looks like this:

 def is_subseq(a, b):
    if len(a) > len(b): return False
    start = 0
    for el in a:
        while start < len(b):
            if el == b[start]:
                break
            start = start + 1
        else:
            return False
    return True

def filter_partial_matches(sets):
     return [s for s in sets if all([not(is_subseq(s, ss)) for ss in sets if s != ss])]

A simple test case, given your inputs and outputs:

>>> test = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
>>> another_test = [[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]
>>> filter_partial_matches(test)
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
>>> filter_partial_matches(another_test)
[[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]

Hope it helps!

answered Aug 23, 2009 at 17:19

João Silva

91.9k29 gold badges156 silver badges158 bronze badges

1 Comment

Oliver Over a year ago

same problem as commented on other solutions, sequences are lost

dugres · Accepted Answer · 2009-08-24 08:56:02Z

0

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

for list1 in list0[:]:
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    list0.remove(list1)
                    break

This filters list0 in place using a copy of it. This is good if the result is expected to be about the same size as the original, there is only a few "subset" to remove.

If the result is expected to be small and the original is large, you might prefer this one who is more memory freindly as it doesn't copy the original list.

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
result=[]

for list1 in list0:
    subset=False
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    subset=True
                    break
            if subset:
                break
    if not subset:
        result.append(list1)

edited Aug 24, 2009 at 8:56

answered Aug 23, 2009 at 17:18

dugres

13.2k8 gold badges48 silver badges52 bronze badges

3 Comments

hughdbrown Over a year ago

You don't need the comparison of list1 and list2 if you keep track of where you are. Use enumerate() to store the index and make sublists that omit that list: "for i, list1 in enumerate(list0):\n for list2 in (list0[:i] + list0[i+1]):\n\n"

dugres Over a year ago

Right, I'm not sure it worth it though.

Oliver Over a year ago

same problem as stated in other solutions.

Ants Aasma · Accepted Answer · 2009-08-24 09:57:17Z

0

Edit: I really need to improve my reading comprehension. Here's the answer to what was actually asked. It exploits the fact that "A is super of B" implies "len(A) > len(B) or A == B".

def advance_to(it, value):
    """Advances an iterator until it matches the given value. Returns False
    if not found."""
    for item in it:
        if item == value:
            return True
    return False

def has_supersequence(seq, super_sequences):
    """Checks if the given sequence has a supersequence in the list of
    supersequences.""" 
    candidates = map(iter, super_sequences)
    for next_item in seq:
        candidates = [seq for seq in candidates if advance_to(seq, next_item)]
    return len(candidates) > 0

def find_supersequences(sequences):
    """Finds the supersequences in the given list of sequences.

    Sequence A is a supersequence of sequence B if B can be created by removing
    items from A."""
    super_seqs = []
    for candidate in sorted(sequences, key=len, reverse=True):
        if not has_supersequence(candidate, super_seqs):
            super_seqs.append(candidate)
    return super_seqs

print(find_supersequences([[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3],
    [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]))
#Output: [[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 8], [2, 3, 21]]

If you need to also preserve the original order of the sequences, then the find_supersequences() function needs to keep track of the positions of the sequences and sort the output afterwards.

edited Aug 24, 2009 at 9:57

answered Aug 23, 2009 at 16:34

Ants Aasma

55.3k16 gold badges98 silver badges99 bronze badges

8 Comments

Oliver Over a year ago

This doesn't respect the list order e.g. if given [[1,2,3,4],[2,4,3],[3,4,5]] the result is [[1,2,3,4],[2,4,3]] when I would want it to return the initial input.

Ants Aasma Over a year ago

@Triptych: he didn't state that in the original question.

Oliver Over a year ago

I did state the order was important "order must be respected". But that's not a bid deal. thanks for offering up possible solutions.

hughdbrown Over a year ago

@Oli_UK: if order is not important, than using sets is the clear winner. Iterative solutions would be a mistake. Can you clarify this point?

hughdbrown Over a year ago

Your second solution includes [1, 2, 4, 5, 6] when it should not. is_superseq() seems to assume that the elements have to be consecutive for a list to be disqualified.

|

quamrana · Accepted Answer · 2009-08-24 21:04:03Z

0

Refined answer after new test case:

original= [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,other):
        if self.set.issubset(other.set):
            #print self.list,'superceded by',other.list
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        s=SetAndList(l)
        for t in temp:
            t.compare(s)
            s.compare(t)
        temp.append( s )
        temp=[t for t in temp if t.isUnique]

    return [t.list for t in temp if t.isUnique]

print listReduce(original)

You didn't give the required output, but I'm guessing this is right, as [1,2,3] does not appear in the output.

answered Aug 24, 2009 at 21:04

quamrana

39.6k13 gold badges57 silver badges77 bronze badges

1 Comment

quamrana Over a year ago

Having re-read the question again (it might have changed since I last read it) my solution is still incorrect. I have missed the: [1,2] is a subset of the list [1,2,3], but not of the list [2,1,3] requirement.

Oliver · Accepted Answer · 2009-08-25 10:37:03Z

Thanks to all who suggested solutions and coping with my sometimes erroneous data sets. Using @hughdbrown solution I modified it to what I wanted:

The modification was to use a sliding window over the target to ensure the subset sequence was found. I think I should have used a more appropriate word than 'Set' to describe my problem.

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0            
            try:
                start = target.index(cand[0])
            except:
                return False

            while start < (len(target) + len(cand)) - start:
                if cand == target[start:len(cand)]:
                    return True
                else:
                    start = target.index(cand[0], start + 1)
        except ValueError:
            return False

    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    a = [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]
    return a

lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
expect = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

def test():
    out = super_lists(list(lists))

    print "In  : ", lists
    print "Out : ", out

    assert (out == expect)

Result:

In  :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
Out :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

One last try: I've got simpler code in my most recent submission.

2 revs · Accepted Answer · 2009-08-25 22:25:14Z

So what you really wanted was to know if a list was a substring, so to speak, of another, with all the matching elements consecutive. Here is code that converts the candidate and the target list to comma-separated strings and does a substring comparison to see if the candidate appears within the target list

def is_sublist_of_any_list(cand, lists):
    def comma_list(l):
        return "," + ",".join(str(x) for x in l) + ","
    cand = comma_list(cand)
    return any(cand in comma_list(target) for target in lists if len(cand) <= len(target))


def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

The function comma_list() puts leading and trailing commas on the list to ensure that integers are fully delimited. Otherwise, [1] would be a subset of [100], for example.

Collectives™ on Stack Overflow

Python list filtering: remove subsets from list of lists

10 Answers 10

3 Comments

6 Comments

4 Comments

2 Comments

1 Comment

3 Comments

8 Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

3 Comments

6 Comments

4 Comments

2 Comments

1 Comment

3 Comments

8 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related