RegExp match repeated characters

Question

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)

I know that I can write something like this:

 ([a]+)|([b]+)|([c]+)|...

But I think i's ugly and looking for better solution. I'm looking for regular expression solution, not self-written finite-state machines.

Qtax · Accepted Answer · 2011-06-10 12:05:37Z

47

You can match that with: (\w)\1*

answered Jun 10, 2011 at 12:05

Qtax

34k9 gold badges92 silver badges127 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

DrTyrsa · Accepted Answer · 2011-06-10 12:07:18Z

26

itertools.groupby is not a RexExp, but it's not self-written either. :-) A quote from python docs:

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D

answered Jun 10, 2011 at 12:07

DrTyrsa

32.1k7 gold badges88 silver badges88 bronze badges

1 Comment

DrTyrsa Over a year ago

@Kobi aaaa bbb aaa, as expected. Btw it returns list of lists, but it can't be a problem. :-)

Community · Accepted Answer · 2020-06-20 09:12:55Z

Generally

The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:

>>> matcher= re.compile(r'(.)\1*')

This matches any single character (.) and then its repetitions (\1*) if any.

For your input string, you can get the desired output as:

>>> [match.group() for match in matcher.finditer('aacbbbqq')]
['aa', 'c', 'bbb', 'qq']

NB: because of the match group, re.findall won't work correctly.

Other ranges

In case you don't want to match any character, change accordingly the . in the regular expression:

>>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
>>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
>>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
>>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore

Check the latter against u'hello²²' (Python 2.x) or 'hello²²' (Python 3.x):

>>> text= u'hello=\xb2\xb2'
>>> print('\n'.join(match.group() for match in matcher.finditer(text)))
h
e
ll
o
²²

\w against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale call.

Rakesh Sankar · Accepted Answer · 2011-06-10 12:08:50Z

8

This will work, see a working example here: http://www.rubular.com/r/ptdPuz0qDV

(\w)\1*

answered Jun 10, 2011 at 12:08

Rakesh Sankar

9,4255 gold badges45 silver badges67 bronze badges

1 Comment

Rakesh Sankar Over a year ago

:-). Was trying it in Rubular to show an working example, got little late.

SwiftsNamesake · Accepted Answer · 2013-12-01 22:33:49Z

5

The findall method will work if you capture the back-reference like so:

result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]

edited Dec 1, 2013 at 22:33

answered Dec 1, 2013 at 5:06

SwiftsNamesake

1,5782 gold badges12 silver badges26 bronze badges

Comments

Wesam Nabki · Accepted Answer · 2017-12-12 12:04:00Z

4

You can use:

re.sub(r"(\w)\1*", r'\1', 'tessst')

The output would be:

'test'

answered Dec 12, 2017 at 12:04

Wesam Nabki

2,65429 silver badges24 bronze badges

Comments

Kate Ishchenko · Accepted Answer · 2023-04-24 20:09:03Z

2

You can try something like this:

import re

string = 'aacbbbqq'
result = re.findall(r'((\w)\2*?)', string)
output = [x[0] for x in result]

print(output)

Output will be :

['aa', 'c', 'bbb', 'qq']

answered Apr 24, 2023 at 20:09

Kate Ishchenko

212 bronze badges

Comments

Aditya Rama Narayana Vuyyuru · Accepted Answer · 2024-04-26 19:03:17Z

0

This raw solution may be usefull..

    string = "helllllo worlddd hhiii "
    i = 0
    j = 1
    b = ''
    l = []
    for a in range(len(string)-1):
        if string[i] !=  string[j]:
            j = j+1
            i = j-1
            if b:
                l.append(b)
                b = ''
        elif string[i] == string[j]:
            if j-i == 1:
                b += string[i:j+1]
            else:
                b += string[i]
            j = j+1
    print(l)

output:

['lllll', 'ddd', 'hh', 'iii']

edited Apr 26, 2024 at 19:03

answered Apr 26, 2024 at 18:51

Aditya Rama Narayana Vuyyuru

11 bronze badge

Collectives™ on Stack Overflow

RegExp match repeated characters

8 Answers 8

Comments

1 Comment

Generally

Other ranges

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

1 Comment

Generally

Other ranges

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related