Find the longest most common items in multiple lists (not substring)

Question

Let's say we have a list of N lists. For example:

L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

I want to find the longest common subsets that occur in this list and the corresponding counts. In this case:

ans = {'A,B,C':2, 'A,C,D':2, 'B,C,D':2}

I think this question is similar to mine, but I am having a hard time understanding the C# code.

If you have a very large number of lists with many elements, you might want to look at this related question, which talks about parallelization and other optimizations for this problem. — kcsquared
– kcsquared, Commented Feb 17, 2022 at 17:54

Ben Grossmann · Accepted Answer · 2022-02-17 18:00:56Z

1

I assume that a "common subset" is a set that is a subset of at least two lists in the array.

With that in mind, here's one solution.

from itertools import combinations
from collections import Counter
L = [['A','B','C','D','E'], ['A','B','C'],['B','C','D'],['C','D'],['A','C','D']]

L = [*map(frozenset,L)]
sets = [l1&l2 for l1,l2 in combinations(L,2)]
maxlen = max(len(s) for s in sets)
sets = [s for s in sets if len(s) == maxlen]
count = Counter(s for s in sets for l in L if s <= l)
dic = {','.join(s):k for s,k in count.items()}

Resulting dictionary dic:

{'A,B,C': 2, 'B,C,D': 2, 'A,C,D': 2}

edited Feb 17, 2022 at 18:00

answered Feb 17, 2022 at 15:49

Ben Grossmann

5,0471 gold badge15 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Alex Over a year ago

Thank you! Would this scale if I have ~200K lists, each of length 100? making all the combinations explicitly might pose an issue, right? And do you mind commenting why you used frozenset?

Ben Grossmann Over a year ago

Making all the subsets is O(n^2); I'm not sure what that means concretely for ~200K lists. I would say just try it and see if it's taking too long. One optimization in that initial step is to keep track of the max-length subset so that we can skip combinations involving elements like ['C','D'] that are too short to consider

Ben Grossmann Over a year ago

Regarding frozensets: sets aren't hashable, so you can't use them as keys for dictionary objects like a Counter.

Ben Grossmann Over a year ago

PS: Regarding that "optimization", keeping track of the max-length really just amounts to filtering out any elements of L smaller than the second-to-largest element.

kcsquared Over a year ago

You should precompute frozenset(l1) for each l1. Right now, that step takes O(m*L^2) time, where L is the number of lists and m is the max size of a list.

|

Collectives™ on Stack Overflow

Find the longest most common items in multiple lists (not substring)

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related