Python: What is an efficient way to loop over a list of strings and group substrings in the list?

Question

Background

mylist = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']

I would like to find and group the substrings in the list into a list of tuples where the first element of the tuple would be the substring and the second element would be the larger string that contains the substring. The expected output is given below

[('abc123', 'abc123456'), ('abc123', 'abc12355'), ('abc123', 'abc123cvz'), ('def456', 'def4567'), ('ghi789', 'ghi78910')]

I've written the following code which achieves the desired outcome

substring_superstring_list = []
for sub in mylist:
   substring_superstring_pair = [(sub, s) for s in mylist if sub in s and s != sub]
   if substring_superstring_pair:
       substring_superstring_list.append(substring_superstring_pair)

flat_list = [item for sublist in substring_superstring_list for item in sublist]

Is there a more efficient way to do this? I'll eventually need to loop over a list containing 80k strings and do the above. I appreciate any suggestions/help

If you sort "mylist" first (which is fast because of C implementation) in ascending order, you can be sure that all superstrings of a sub are after the sub in the list and before any entry which is either shorter than sub or the first "len(sub)" characters aren't equal to sub. — Michael Butscher
– Michael Butscher, Commented Sep 26, 2022 at 21:06

Nin17 · Accepted Answer · 2022-09-28 04:02:00Z

2

Combining suggestions in the comments and @ZabielskiGrabriel's answer, you can do it by first sorting the list and then comparing each element in the sorted list with those that follow it in a list comprehension:

my_list = sorted(my_list, key=len)
[(x, y) for i, x in enumerate(my_list, 1) for y in my_list[i:] if x in y]

Benchmarks (with supplied test list):

%timeit op(my_list)
%timeit zabiel(my_list)
%timeit nin17(my_list)

Output:

3.92 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.76 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.25 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

edited Sep 28, 2022 at 4:02

answered Sep 26, 2022 at 21:51

Nin17

3,6022 gold badges7 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AndrzejO Over a year ago

This gives wrong results for my_list = ['b', 'ab'] – it outputs an empty set. The list must be sorted by key length, not alphabetically: my_list = sorted(my_list, key=len)

ZabielskiGabriel · Accepted Answer · 2022-09-26 21:26:28Z

0

Tomorrow I will try another method with map, reduce and filter. Also here you can find a nice tutorial about it:

my_list = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']

output = []
for x in my_list:
    for y in my_list:
        if x in y and x != y:
            output.append((x, y))
print(output)

answered Sep 26, 2022 at 21:26

ZabielskiGabriel

6205 silver badges13 bronze badges

2 Comments

ZabielskiGabriel Over a year ago

Btw, 80k of items shouldn't be a problem for python

AndrzejO Over a year ago

80k is on the edge what python can do in a reasonable time. It's 640 million iterations of substring checking

AndrzejO · Accepted Answer · 2022-09-27 05:36:11Z

A much more efficient way is to use multiprocessing – depending on how many cores you have – on my 8-core-pc it's 10-15 times faster. It's quite easy to do, just change the first for loop into map and use multiprocessing.Pool:

    global find_sub2
    def find_sub2(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    pool = multiprocessing.Pool(processes=16)
    substring_superstring_list = pool.map(find_sub2, mylist)
    pool.close()
    flat_list = [item for sublist in substring_superstring_list for item in sublist]

I have compared the times of some methods (based on your and from other answers) with a list of 20000 random strings of a random size 10-200:

['original', '31.936 seconds']
['traditional_loops', '64.088 seconds']
['two_for_loops', '32.337 seconds'
['sorting', '17.713 seconds']
['with_map', '31.832 seconds']
['map_with_multiprocessing', '3.08 seconds']

Here the code:

from tqdm import tqdm
import multiprocessing
import random
import time

ALLOWED_CHARS = 'abcdeghijklmn'
NUMBER_OF_STRINGS = 20000
MIN_STR_LENGTH = 10
MAX_STR_LENGTH = 100

def random_string_generator(str_size, allowed_chars=ALLOWED_CHARS):
    return ''.join(random.choice(allowed_chars) for _ in range(str_size))


print('Creating random strings')
mylist = [random_string_generator(random.randint(MIN_STR_LENGTH, MAX_STR_LENGTH)) for _ in tqdm(range(NUMBER_OF_STRINGS))]


def original():
    substring_superstring_list = []
    for sub in tqdm(mylist):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            substring_superstring_list.append(sub_pair)
    return [item for sublist in substring_superstring_list for item in sublist]


def traditional_loops():
    output = []
    for i in tqdm(range(len(mylist))):
        for j in range(len(mylist)):
            if i != j and mylist[i] in mylist[j]:
                output.append((mylist[i], mylist[j]))
    return output


def two_for_loops():
    flat_list = []
    for x in tqdm(mylist):
        for y in mylist:
            if x in y and x != y:
                flat_list.append((x, y))
    return flat_list


def with_map():
    def find_sub(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    substring_superstring_list = map(find_sub, tqdm(mylist))
    return [item for sublist in substring_superstring_list for item in sublist]


def map_with_multiprocessing():
    global find_sub2
    def find_sub2(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    pool = multiprocessing.Pool(processes=16)
    substring_superstring_list = pool.map(find_sub2, tqdm(mylist))
    pool.close()
    return [item for sublist in substring_superstring_list for item in sublist]


def sorting():
    my_list = sorted(mylist)
    return [(x, y) for i, x in enumerate(tqdm(my_list), 1) for y in my_list[i:] if x in y]


methods = [original, traditional_loops, two_for_loops, sorting, with_map, map_with_multiprocessing]
results = []
for fun in methods:
    print()
    print(f'Start testing {fun.__name__}')
    start = time.time()
    flat_list = fun()
    #print(flat_list)
    end = time.time()
    result = [fun.__name__, f'{int(1000 * (end - start)) / 1000.} seconds', flat_list]
    results.append(result)

solution = (set(results[0][2]), len(results[0][2]))
print()
for i in results:
    print(f'{i[:2]} Solution is correct? {solution == (set(i[2]), len(i[2]))}')

Collectives™ on Stack Overflow

Python: What is an efficient way to loop over a list of strings and group substrings in the list?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related