How to find contiguous substrings from a string in python

Question

I have a string abccddde

I need to find substrings like: a, b, c, cc, d, dd, ddd, e

substrings ab or cd are not valid.

I tried finding all the substrings from a string but its not efficient

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j+1] for i in range(length) for j in range(i,length)]

This is outputting:

['a', 'ab', 'abc', 'abcc', 'abccd', 'abccdd', 'abccddd', 'abccddde', 'b', 'bc', 'bcc', 'bccd', 'bccdd', 'bccddd', 'bccddde', 'c', 'cc', 'ccd', 'ccdd', 'ccddd', 'ccddde', 'c', 'cd', 'cdd', 'cddd', 'cddde', 'd', 'dd', 'ddd', 'ddde', 'd', 'dd', 'dde', 'd', 'de', 'e']

This was the method i followed to find the substrings but it gives all the possiblities but that is what makes it inefficient Please Help!

Are the same symbols in the source string always grouped together? If yes, the solution with collections.Counter is ok. If no, it isn't. — Ellioh
– Ellioh, Commented Jan 28, 2017 at 6:27
if i take a string 'ccdc' its giving me output as 'c', 'cc', 'ccc', 'd' which is wrong. It should not print 'ccc' as its not a contiguous substring. So the collections.Counter solution is not completely correct. — user7481832
– user7481832, Commented Jan 28, 2017 at 6:29

Mohammad Yusuf · Accepted Answer · 2021-05-17 05:47:44Z

4

You can use itertools.groupby() for this:

from itertools import groupby

s = 'abccdddcce'
l1 = ["".join(g) for k, g in groupby(s)]
l2 = [a[:i+1] for a in l1 for i in range(len(a))]
print l2

Output:

['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'c', 'cc', 'e']

For larger input data, replace the Lists with Generators,

l1=()
l2=()

edited May 17, 2021 at 5:47

answered Jan 28, 2017 at 6:36

Mohammad Yusuf

17.1k12 gold badges60 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user7481832 Over a year ago

this answer is really cool. But is it efficient enough for larger inputs ? I tried with larger inputs and its taking a bit time

Mohammad Yusuf Over a year ago

@Sagar What you can do is make it a generator instead of list l1=(..); l2=(..) and then you can loop over l2 to get the strings.

user7481832 Over a year ago

This solution is a silverbullet for my problem.thanks a lot mate.

Mariy · Accepted Answer · 2017-01-28 06:57:21Z

1

itertools.groupby can tell you the number of consecutive chars. After that for each group you have the char repeated upto that number.

from itertools import groupby

def substrings(s):
    for char, group in groupby(s):
        substr = ''
        for i in group:
            substr += i
            yield substr

for result in substrings('abccdddcce'):
    print(result)

edited Jan 28, 2017 at 6:57

answered Jan 28, 2017 at 6:45

Mariy

5,9644 gold badges44 silver badges60 bronze badges

2 Comments

Dan D. Over a year ago

You can leave out the key function in groupby(s, lambda x: x) as it defaults to identity: groupby(s).

user7481832 Over a year ago

thanks a lot for the solution.even this worked for me.

Kasravnd · Accepted Answer · 2017-01-28 06:20:26Z

0

here is one way using regex:

In [85]: [j for i in re.findall(r'((\w)(\2+)?)', s) for j in set(i) if j]
Out[85]: ['a', 'b', 'c', 'cc', 'ddd', 'dd', 'd', 'e']

edited Jan 28, 2017 at 6:20

answered Jan 28, 2017 at 6:11

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

1 Comment

user7481832 Over a year ago

this also partially worked for me. anyway thanks for the solution. :)

abcabc · Accepted Answer · 2017-01-28 07:00:31Z

0

from itertools import groupby

def runlength_compress(src):
    return ((k, sum(1 for _ in g)) for k,g in groupby(src))

def contiguous_substrings(src):
    return [c*(i+1) for c, count in runlength_compress(src) for i in range(count)]

print(contiguous_substrings('abccddde'))

answered Jan 28, 2017 at 7:00

abcabc

613 bronze badges

Comments

Steven Summers · Accepted Answer · 2017-01-28 07:12:07Z

The following will do what you want. I don't know if its efficient compared to other solutions though.

def get_all_substrings(text):
    res = []
    prev = ''
    s = ''

    for c in text:
        if c == prev:
            s += c
        else:
            s = prev = c
        res.append(s)

    return res

# Output
>>> get_all_substrings('abccddde')
['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'e']
>>> get_all_substrings('abccdddec')
['a', 'b', 'c', 'cc', 'd', 'dd', 'ddd', 'e', 'c']

Timings

import timeit
import random

size = 100
values = 'abcde'
s = ''.join(random.choice(values) for _ in range(size))

print(s)

print(timeit.timeit("get_all_substrings(s)",
                    setup = 'from __main__ import s, get_all_substrings',
                    number = 10000) )

# Example for size 100 input
abbaaebacddbdedbdbbacadcdddabaeabacdcbeebbccaadebdcecadcecceececcacebacecbbccdedddddabaeeceeeccabdcc
0.16761969871275967

Collectives™ on Stack Overflow

How to find contiguous substrings from a string in python

5 Answers 5

3 Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

2 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related