0

I'm trying to write a regex in python to get F1 to F8 fields from a line that looks like this:

LineNumber(digits): F1, F2, F3, ..., F8;

F1 to F8 can have lowercase/uppercase letters and hyphens.

For example:

Header
Description
21: Yes, No, Yes, No, Ye-s, N-o, YES, NO;
Footer

What I've tried so far is matched = re.match(r'\d+: ([a-zA-Z-]*, ){7}(.*);', line) which matches the lines with the above format. However, when I call matched.groups() to print the matched fields, I only get F7, and F8 while the expected output is a list containing F1, to F7, plus F8.

I have a few questions regarding this regex:

  1. I guess groups() method returns the fields that were grouped in the regex using (...). Why don't I get F1 to F6 in the output while they are grouped using (...) and have matched the regex?

  2. What is a better regex I can write to exclude , from F1 to F7? (A short explanation of the suggested regex is much appreciated)

9
  • 1
    Why don't you just parse it as a csv? Commented Sep 7, 2016 at 23:09
  • @PadraicCunningham There are other lines in the document that don't match this pattern. Furthermore, the line numbers are printed at the beginning of each line. Would parsing as a CSV still work? Commented Sep 7, 2016 at 23:11
  • Add a proper sample and I can tell you. Commented Sep 7, 2016 at 23:11
  • 1
    Check for lines that don't match, and remove the line number with line.split(': ')[1]. Commented Sep 7, 2016 at 23:11
  • @PadraicCunningham I wrote an example in problem description. Commented Sep 7, 2016 at 23:15

2 Answers 2

1
>>> pat = re.compile("""\s+ # one or more spaces
                      (.*?) # the shortest anything (capture)
                      \s*   # zero or more spaces
                      [;,]  # a semicolon or a colon
                     """,re.X)
>>> pat.findall("LineNumber(digits): F1, F2, F3, F4, F5, F6, F7, F8;")
['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8']
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Joran. This solution works well. Can you briefly explain how this works?
there you go :)
Thanks a lot. The only problem with this solution is that is matches some text in header that is comma separated but doesn't follow the pattern.
0

When you have a construct like (pattern){number} then although it matches multiple instances, only the last one will be stored. In other words, you get one bucket per (), even if you parse it multiple times, in which case the last instance is the one kept. Note that you will get a bucket for ALL bracket pairs, even if they are not used, as in something like (a(b)?c)?d matching d.

If you know how many items to expect, then you can do your regexp the long way:

\d+: *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *;

This way, since you have 8 sets of brackets, you have 8 items in your matched.groups() array. Also, we're not capturing the spaces and commas between the fields.

Given that your string is a CSV, you may be better off parsing it differently and splitting on commas rather than trying to have a single regexp to match the whole line.

4 Comments

Thanks Steve. There will always be 8 fields that I have to read. But, I was trying to avoid the solution that repeats the same pattern multiple times. Thanks for your thorough explanation.
why not just global it?
this is not an ideal regex ... I think you can do much better
I agree a better regexp is possible; I'm more addressing the explanation of why (xx){7} doesn't set 7 items, and how to exclude the commas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.