What's the most efficient way of identifying repeated pattern in array of objects using Python

Question

I have two arrays of 5 objects

a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f']

b = ['a', 'b', 'd', 'f', 'e', 'f']

I would like to identify the repeated patterns of more than one object and their occurrences like

['a', 'b']: 2

['e', 'f']: 3

['f', 'e', 'f']: 2

The first sequence ['a', 'b'] appeared once in a and once in b, so total count 2. The 2nd sequence ['e', 'f'] appeared twice in a, once in b, so total 3. The 3rd sequence ['f', 'e', 'f'] appeared once in a, and once in b, so total 2.

Is there a good way to do this in Python?

Also the universe of objects is limited. Was wondering if there's an efficient solution that utilizes hash table?

What is the actual problem you are trying to solve? Please review minimal reproducible example: What types of objects, what the pattern of objects in these lists accomplishes. — TemporalWolf
– TemporalWolf, Commented Feb 14, 2017 at 0:17

SSSINISTER · Accepted Answer · 2017-02-14 21:49:21Z

3

If the approach is only for two lists, the following approach should work. I am not sure if this is the most efficient solution though.

A nice description of find n-grams is given in this blog post.

This approach provides the min length and determines the max length that a repeating sequence of a list might have (at most half the length of the list).

We then find all the sequences for each of the lists by combining the sequences for individual lists. Then we have a counter of every sequence and its count.

Finally we return a dictionary of all the sequences that occur more than once.

def find_repeating(list_a, list_b):
    min_len = 2

    def find_ngrams(input_list, n):
        return zip(*[input_list[i:] for i in range(n)])

    seq_list_a = []
    for seq_len in range(min_len, len(list_a) + 1):
        seq_list_a += [val for val in find_ngrams(list_a, seq_len)]

    seq_list_b = []
    for seq_len in range(min_len, len(list_b) + 1):
        seq_list_b += [val for val in find_ngrams(list_b, seq_len)]

    all_sequences = seq_list_a + seq_list_b

    counter = {}
    for seq in all_sequences:
        counter[seq] = counter.get(seq, 0) + 1

    filtered_counter = {k: v for k, v in counter.items() if v > 1}

    return filtered_counter

Do let me know if you are unsure about anything.

>>> list_a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f'] 
>>> list_b = ['a', 'b', 'd', 'f', 'e', 'f']
>>> print find_repeating(list_a, list_b)
{('f', 'e'): 2, ('e', 'f'): 3, ('f', 'e', 'f'): 2, ('a', 'b'): 2}

edited Feb 14, 2017 at 21:49

answered Feb 14, 2017 at 0:58

SSSINISTER

665 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ian Lin Over a year ago

Thanks! I think you need to cast the max_len_a and max_len_b to integer right?

Ian Lin Over a year ago

how would you modify this if I'm looking for the longest overlapping pattern? E.g. ('f', 'e', 'f') covers ('f', 'e') and ('e', 'f'). So if i expect the answer to be like {('e', 'f'): 1, ('f', 'e', 'f'): 2, ('a', 'b'): 2}, how should i modify the code?

Matthew Cole Over a year ago

Counter example to the above code: list_a = ['a','b','c','d','e','f','a'] list_b = ['a','b','c','d','e','f','b']. The common subsequence should be abcdef. However the code above limits search space to int(len(list_a)/2) and int(len(list_b)/2), causing it to produce 9 subsequences total, and 4 longest subsequences of length 3 (abc, 'bcd, cde` and def). It appears that this code fragment does not answer the question.

SSSINISTER Over a year ago

@MatthewCole - Thanks for pointing out the error in my search space, changing int(len(list_a)/2) to len(list_a) solves that issue however. The edited code fragment does answer the question. @IanLin - I ll look into the problem and edit the fragment accordingly.

SSSINISTER Over a year ago

@MatthewCole Moreover, although your code may be shorter, it is not necessarily more efficient and is appx. 3 times slower than mine. Do feel free to check.

Matthew Cole · Accepted Answer · 2017-02-14 16:27:59Z

When you mentioned that you were looking for an efficient solution, my first thought was of the approaches to solving the longest common subsequence problem. But in your case, we actually do need to enumerate all common subsequences so that we can count them, so a dynamic programming solution will not do. Here's my solution. It's certainly shorter than SSSINISTER's solution (mostly because I use the collections.Counter class).

#!/usr/bin/env python3

def find_repeating(sequence_a, sequence_b, min_len=2):
    from collections import Counter

    # Find all subsequences
    subseq_a = [tuple(sequence_a[start:stop]) for start in range(len(sequence_a)-min_len+1) 
        for stop in range(start+min_len,len(sequence_a)+1)]
    subseq_b = [tuple(sequence_b[start:stop]) for start in range(len(sequence_b)-min_len+1) 
        for stop in range(start+min_len,len(sequence_b)+1)]

    # Find common subsequences
    common = set(tup for tup in subseq_a if tup in subseq_b)

    # Count common subsequences
    return Counter(tup for tup in (subseq_a + subseq_b) if tup in common)

Resulting in ...

>>> list_a = ['a', 'b', 'c', 'd', 'e', 'f', 'e', 'f'] 
>>> list_b = ['a', 'b', 'd', 'f', 'e', 'f']
>>> print(find_repeating(list_a, list_b))
Counter({('e', 'f'): 3, ('f', 'e'): 2, ('a', 'b'): 2, ('f', 'e', 'f'): 2})

The advantage to using collections.Counter is that not only do you not need to produce the actual code to iterate and count, you get access to all of the dict methods as well as a few specialized methods for using those counts.

Collectives™ on Stack Overflow

What's the most efficient way of identifying repeated pattern in array of objects using Python

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related