How to write a regex for a text including comma separated values in python?

Question

I'm trying to write a regex in python to get F1 to F8 fields from a line that looks like this:

LineNumber(digits): F1, F2, F3, ..., F8;

F1 to F8 can have lowercase/uppercase letters and hyphens.

For example:

Header
Description
21: Yes, No, Yes, No, Ye-s, N-o, YES, NO;
Footer

What I've tried so far is matched = re.match(r'\d+: ([a-zA-Z-]*, ){7}(.*);', line) which matches the lines with the above format. However, when I call matched.groups() to print the matched fields, I only get F7, and F8 while the expected output is a list containing F1, to F7, plus F8.

I have a few questions regarding this regex:

I guess groups() method returns the fields that were grouped in the regex using (...). Why don't I get F1 to F6 in the output while they are grouped using (...) and have matched the regex?
What is a better regex I can write to exclude , from F1 to F7? (A short explanation of the suggested regex is much appreciated)

@PadraicCunningham There are other lines in the document that don't match this pattern. Furthermore, the line numbers are printed at the beginning of each line. Would parsing as a CSV still work? — Matt
– Matt, Commented Sep 7, 2016 at 23:11
Check for lines that don't match, and remove the line number with line.split(': ')[1]. — TigerhawkT3
– TigerhawkT3, Commented Sep 7, 2016 at 23:11
@PadraicCunningham I wrote an example in problem description. — Matt
– Matt, Commented Sep 7, 2016 at 23:15

Joran Beasley · Accepted Answer · 2016-09-07 23:36:27Z

1

>>> pat = re.compile("""\s+ # one or more spaces
                      (.*?) # the shortest anything (capture)
                      \s*   # zero or more spaces
                      [;,]  # a semicolon or a colon
                     """,re.X)
>>> pat.findall("LineNumber(digits): F1, F2, F3, F4, F5, F6, F7, F8;")
['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8']

edited Sep 7, 2016 at 23:36

answered Sep 7, 2016 at 23:32

Joran Beasley

114k13 gold badges168 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Matt Over a year ago

Thanks Joran. This solution works well. Can you briefly explain how this works?

Joran Beasley Over a year ago

there you go :)

Matt Over a year ago

Thanks a lot. The only problem with this solution is that is matches some text in header that is comma separated but doesn't follow the pattern.

Steve Shipway · Accepted Answer · 2016-09-07 23:28:55Z

0

When you have a construct like (pattern){number} then although it matches multiple instances, only the last one will be stored. In other words, you get one bucket per (), even if you parse it multiple times, in which case the last instance is the one kept. Note that you will get a bucket for ALL bracket pairs, even if they are not used, as in something like (a(b)?c)?d matching d.

If you know how many items to expect, then you can do your regexp the long way:

\d+: *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *;

This way, since you have 8 sets of brackets, you have 8 items in your matched.groups() array. Also, we're not capturing the spaces and commas between the fields.

Given that your string is a CSV, you may be better off parsing it differently and splitting on commas rather than trying to have a single regexp to match the whole line.

edited Sep 7, 2016 at 23:28

answered Sep 7, 2016 at 23:19

Steve Shipway

4,1523 gold badges24 silver badges39 bronze badges

4 Comments

Matt Over a year ago

Thanks Steve. There will always be 8 fields that I have to read. But, I was trying to avoid the solution that repeats the same pattern multiple times. Thanks for your thorough explanation.

A. L Over a year ago

why not just global it?

Joran Beasley Over a year ago

this is not an ideal regex ... I think you can do much better

Steve Shipway Over a year ago

I agree a better regexp is possible; I'm more addressing the explanation of why (xx){7} doesn't set 7 items, and how to exclude the commas.

Collectives™ on Stack Overflow

How to write a regex for a text including comma separated values in python?

2 Answers 2

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related