0

I'm just learning python and having a problem figuring out how to create the regex pattern for the following string

"...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."

I'm trying to extract the data between the begin: and :end for n iterations without getting duplicate data. I've attached my current attempt.

    for m in re.finditer('.begin:(.*),(.*):(.*):(.*:.*):end.', list_to_string(j), re.DOTALL):
    print m.group(1)
    print m.group(2)
    print m.group(3)
    print m.group(4)

the output is:

begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33
13
2
2006-11-31 T 11:46

and I want it to be:

32
12
1
2005-10-30 T 10:45
33
13
2
2006-11-31 T 11:46

Thank you for any help.

3 Answers 3

2

.* is greedy, matching across your intended :end boundary. Replace all .*s with lazy .*?.

>>> s = """...', 'begin:32,12:1:2005-10-30 T 10:45:end', 'begin:33,13:2:2006-11-31 T 11:46:end', '... <div dir="ltr">begin:32,12:1:2005-10-30 T 10:45:end<br>begin:33,13:2:2006-11-31 T 11:46:end<br>..."""
>>> re.findall("begin:(.*?),(.*?):(.*?):(.*?:.*?):end", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46'), 
 ('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]

With a modified pattern, forcing single quotes to be present at the start/end of the match:

>>> re.findall("'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", s)
[('32', '12', '1', '2005-10-30 T 10:45'), ('33', '13', '2', '2006-11-31 T 11:46')]
Sign up to request clarification or add additional context in comments.

5 Comments

@darls: Read about quantifiers. A greedy quantifier matches as much as possible, a lazy quantifier matches as little as possible.
I got it to work with the ".*?" and >begin: ... :end<. How could I modify the pattern to identify the iteration beginning and ending with the ' character?
@darls: Where's the problem? "'begin:(.*?),(.*?):(.*?):(.*?:.*?):end'", or am I understanding something wrong?
it doesn't match anything using that pattern. Possibly because the ' is used to define the pattern(?). ''...'' doesn't work neither does '\'...\''. This is more for personal interest at this point.
I've copied your example string, and the regex matches perfectly (I did have to include it in triple quotes because your example string contains both single and double quotes).
0

You need to make the variable-sized parts of your pattern "non-greedy". That is, make them match the smallest possible string rather than the longest possible (which is the default).

Try the pattern '.begin:(.*?),(.*?):(.*?):(.*?:.*?):end.'.

Comments

0

Another option to Blckknght and Tim Pietzcker's is

re.findall("begin:([^,]*),([^:]*):([^:]*):([^:]*:[^:]*):end", s)

Instead of choosing non-greedy extensions, you use [^X] to mean "any character but X" for some X.

The advantage is that it's more rigid: there's no way to get the delimiter in the result, so

'begin:33,13:134:2:2006-11-31 T 11:46:end'

would not match, whereas it would for Blckknght and Tim Pietzcker's. For this reason, it's also probably faster on edge cases. This is probably unimportant in real-world circumstances.

The disadvantage is that it's more rigid, of course.

I suggest to choose whichever one makes more intuitive sense, 'cause both methods work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.