0

Background

mylist = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']

I would like to find and group the substrings in the list into a list of tuples where the first element of the tuple would be the substring and the second element would be the larger string that contains the substring. The expected output is given below

[('abc123', 'abc123456'), ('abc123', 'abc12355'), ('abc123', 'abc123cvz'), ('def456', 'def4567'), ('ghi789', 'ghi78910')]

I've written the following code which achieves the desired outcome

substring_superstring_list = []
for sub in mylist:
   substring_superstring_pair = [(sub, s) for s in mylist if sub in s and s != sub]
   if substring_superstring_pair:
       substring_superstring_list.append(substring_superstring_pair)

flat_list = [item for sublist in substring_superstring_list for item in sublist]

Is there a more efficient way to do this? I'll eventually need to loop over a list containing 80k strings and do the above. I appreciate any suggestions/help

2
  • 1
    you probably want to create a trie tree Commented Sep 26, 2022 at 20:58
  • 1
    If you sort "mylist" first (which is fast because of C implementation) in ascending order, you can be sure that all superstrings of a sub are after the sub in the list and before any entry which is either shorter than sub or the first "len(sub)" characters aren't equal to sub. Commented Sep 26, 2022 at 21:06

3 Answers 3

2

Combining suggestions in the comments and @ZabielskiGrabriel's answer, you can do it by first sorting the list and then comparing each element in the sorted list with those that follow it in a list comprehension:

my_list = sorted(my_list, key=len)
[(x, y) for i, x in enumerate(my_list, 1) for y in my_list[i:] if x in y]

Benchmarks (with supplied test list):

%timeit op(my_list)
%timeit zabiel(my_list)
%timeit nin17(my_list)

Output:

3.92 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.76 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.25 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Sign up to request clarification or add additional context in comments.

1 Comment

This gives wrong results for my_list = ['b', 'ab'] – it outputs an empty set. The list must be sorted by key length, not alphabetically: my_list = sorted(my_list, key=len)
0

Tomorrow I will try another method with map, reduce and filter. Also here you can find a nice tutorial about it:


my_list = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']

output = []
for x in my_list:
    for y in my_list:
        if x in y and x != y:
            output.append((x, y))
print(output)

2 Comments

Btw, 80k of items shouldn't be a problem for python
80k is on the edge what python can do in a reasonable time. It's 640 million iterations of substring checking
0

A much more efficient way is to use multiprocessing – depending on how many cores you have – on my 8-core-pc it's 10-15 times faster. It's quite easy to do, just change the first for loop into map and use multiprocessing.Pool:

    global find_sub2
    def find_sub2(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    pool = multiprocessing.Pool(processes=16)
    substring_superstring_list = pool.map(find_sub2, mylist)
    pool.close()
    flat_list = [item for sublist in substring_superstring_list for item in sublist]

I have compared the times of some methods (based on your and from other answers) with a list of 20000 random strings of a random size 10-200:

['original', '31.936 seconds']
['traditional_loops', '64.088 seconds']
['two_for_loops', '32.337 seconds'
['sorting', '17.713 seconds']
['with_map', '31.832 seconds']
['map_with_multiprocessing', '3.08 seconds']

Here the code:

from tqdm import tqdm
import multiprocessing
import random
import time

ALLOWED_CHARS = 'abcdeghijklmn'
NUMBER_OF_STRINGS = 20000
MIN_STR_LENGTH = 10
MAX_STR_LENGTH = 100

def random_string_generator(str_size, allowed_chars=ALLOWED_CHARS):
    return ''.join(random.choice(allowed_chars) for _ in range(str_size))


print('Creating random strings')
mylist = [random_string_generator(random.randint(MIN_STR_LENGTH, MAX_STR_LENGTH)) for _ in tqdm(range(NUMBER_OF_STRINGS))]


def original():
    substring_superstring_list = []
    for sub in tqdm(mylist):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            substring_superstring_list.append(sub_pair)
    return [item for sublist in substring_superstring_list for item in sublist]


def traditional_loops():
    output = []
    for i in tqdm(range(len(mylist))):
        for j in range(len(mylist)):
            if i != j and mylist[i] in mylist[j]:
                output.append((mylist[i], mylist[j]))
    return output


def two_for_loops():
    flat_list = []
    for x in tqdm(mylist):
        for y in mylist:
            if x in y and x != y:
                flat_list.append((x, y))
    return flat_list


def with_map():
    def find_sub(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    substring_superstring_list = map(find_sub, tqdm(mylist))
    return [item for sublist in substring_superstring_list for item in sublist]


def map_with_multiprocessing():
    global find_sub2
    def find_sub2(sub):
        sub_pair = [(sub, s) for s in mylist if sub in s and s != sub]
        if sub_pair:
            return sub_pair
        else:
            return []
    pool = multiprocessing.Pool(processes=16)
    substring_superstring_list = pool.map(find_sub2, tqdm(mylist))
    pool.close()
    return [item for sublist in substring_superstring_list for item in sublist]


def sorting():
    my_list = sorted(mylist)
    return [(x, y) for i, x in enumerate(tqdm(my_list), 1) for y in my_list[i:] if x in y]


methods = [original, traditional_loops, two_for_loops, sorting, with_map, map_with_multiprocessing]
results = []
for fun in methods:
    print()
    print(f'Start testing {fun.__name__}')
    start = time.time()
    flat_list = fun()
    #print(flat_list)
    end = time.time()
    result = [fun.__name__, f'{int(1000 * (end - start)) / 1000.} seconds', flat_list]
    results.append(result)

solution = (set(results[0][2]), len(results[0][2]))
print()
for i in results:
    print(f'{i[:2]} Solution is correct? {solution == (set(i[2]), len(i[2]))}')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.