38

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)  

I know that I can write something like this:

 ([a]+)|([b]+)|([c]+)|...  

But I think i's ugly and looking for better solution. I'm looking for regular expression solution, not self-written finite-state machines.

8 Answers 8

47

You can match that with: (\w)\1*

Sign up to request clarification or add additional context in comments.

Comments

26

itertools.groupby is not a RexExp, but it's not self-written either. :-) A quote from python docs:

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D

1 Comment

@Kobi aaaa bbb aaa, as expected. Btw it returns list of lists, but it can't be a problem. :-)
26

Generally

The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:

>>> matcher= re.compile(r'(.)\1*')

This matches any single character (.) and then its repetitions (\1*) if any.

For your input string, you can get the desired output as:

>>> [match.group() for match in matcher.finditer('aacbbbqq')]
['aa', 'c', 'bbb', 'qq']

NB: because of the match group, re.findall won't work correctly.

Other ranges

In case you don't want to match any character, change accordingly the . in the regular expression:

>>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
>>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
>>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
>>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore

Check the latter against u'hello²²' (Python 2.x) or 'hello²²' (Python 3.x):

>>> text= u'hello=\xb2\xb2'
>>> print('\n'.join(match.group() for match in matcher.finditer(text)))
h
e
ll
o
²²

\w against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale call.

Comments

8

This will work, see a working example here: http://www.rubular.com/r/ptdPuz0qDV

(\w)\1*

1 Comment

:-). Was trying it in Rubular to show an working example, got little late.
5

The findall method will work if you capture the back-reference like so:

result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]

Comments

4

You can use:

re.sub(r"(\w)\1*", r'\1', 'tessst')

The output would be:

'test'

Comments

2

You can try something like this:

import re

string = 'aacbbbqq'
result = re.findall(r'((\w)\2*?)', string)
output = [x[0] for x in result]

print(output)

Output will be :

['aa', 'c', 'bbb', 'qq']

Comments

0

This raw solution may be usefull..

    string = "helllllo worlddd hhiii "
    i = 0
    j = 1
    b = ''
    l = []
    for a in range(len(string)-1):
        if string[i] !=  string[j]:
            j = j+1
            i = j-1
            if b:
                l.append(b)
                b = ''
        elif string[i] == string[j]:
            if j-i == 1:
                b += string[i:j+1]
            else:
                b += string[i]
            j = j+1
    print(l)

output:

['lllll', 'ddd', 'hh', 'iii'] 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.