Python: help composing regex pattern

Question

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string

"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."

I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.

    for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
    print m.group(1)
    print m.group(2)
    print m.group(3)
    print m.group(4)

the output is:

begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46

and I want it to be:

32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46

Thank you for any help.

Tim Pietzcker · Accepted Answer · 2013-09-22 21:26:56Z

2

.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.

>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'), 
 ('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]

With a modified pattern, forcing single quotes to be present at the start/end of the match:

>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]

edited Sep 22, 2013 at 21:26

answered Sep 22, 2013 at 20:44

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tim Pietzcker Over a year ago

@darls: Read about quantifiers. A greedy quantifier matches as much as possible, a lazy quantifier matches as little as possible.

darls Over a year ago

I got it to work with the ".*?" and >begin: ... :end<. How could I modify the pattern to identify the iteration beginning and ending with the ' character?

Tim Pietzcker Over a year ago

@darls: Where's the problem? "'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", or am I understanding something wrong?

darls Over a year ago

it doesn't match anything using that pattern. Possibly because the ' is used to define the pattern(?). ''...'' doesn't work neither does '\'...\''. This is more for personal interest at this point.

Tim Pietzcker Over a year ago

I've copied your example string, and the regex matches perfectly (I did have to include it in triple quotes because your example string contains both single and double quotes).

Blckknght · Accepted Answer · 2013-09-22 20:47:05Z

0

You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).

Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.

answered Sep 22, 2013 at 20:47

Blckknght

106k11 gold badges135 silver badges188 bronze badges

Comments

Veedrac · Accepted Answer · 2013-09-22 21:10:07Z

0

Another option to Blckknght and Tim Pietzcker's is

re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)

Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.

The advantage is that it's more rigid: there's no way to get the delimiter in the result, so

'begin:33,13:134:2:2006-11-31 T 11:46:end'

would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.

The disadvantage is that it's more rigid, of course.

I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

answered Sep 22, 2013 at 21:10

Veedrac

60.7k15 gold badges120 silver badges177 bronze badges

Collectives™ on Stack Overflow

Python: help composing regex pattern

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related