regular expression match issue in Python 2.7

Question

Using Python 2.7, want to use regular expression to find the Hello part of a given string. The rule is, Hello maybe in the pattern starts with {(1N), {(2N) (until 10N), or combination of them {(1N,2N,3N,4N), and ends with }.

Besides match the Hello part, I also want to know if 1N match, or 2N match or 10N match, or either 1N or 2N match.

Any solutions are appreciated.

  Some content  {(1N,2N,3N,4N) Hello } Some content 

  Some content  {(1N) Python } Some content 

  Some content {(2N) Regex } Some content

In the first example, I want to know 1N,2N,3N,4N matches, and the matched string is Hello;

In the 2nd example, I want to know 1N matches, and matched string is Python; In the 3rd example, I want to know 2N matches, and matched string is Regex;

regards, Lin

@pistache, I tried to match {(1N (.*?) } through {(10N (.*?) }, match 10 times for a given string, which sounds a bit stupid and I also need to remove some prefix unnecessary matches, so it is why I come here to ask. Do you have some more efficient solutions? :) — Lin Ma
– Lin Ma, Commented Aug 13, 2016 at 2:29
Have you tried playing around in any of the online Python regex testers? Since you have so many variables you might have to make several passes - first write an expression to match the \{\(.*?\) Hello \} part. — wwii
– wwii, Commented Aug 13, 2016 at 2:41
Thanks @wwii, I may mis-interpreter the question. Hello part could be any string, I just use Hello as example, I will update the question. If you have any good ideas, it will be great. — Lin Ma
– Lin Ma, Commented Aug 13, 2016 at 2:44

pistache · Accepted Answer · 2016-08-13 02:46:11Z

2

Regular expressions cannot really count (which is why you say you tried to write 10 times the same pattern), but instead you can match the sequence and then split to count :

In [100]: match = re.compile(r"\{\s?\(\s?((\d+N,?)+)\)\s?(.*)\s?\}").search("Some content  { (1N,2N,3N,4N) Hello } Some content")

In [101]: items, _, text = match.groups()

In [102]: splitted = items.split(',')

In [103]: print(splitted)
['1N', '2N', '3N', '4N']

In [104]: print(text)
Hello

NOTE: All the \s? are there to handle optional blanks, remove them if you know you don't need at certain places.

edited Aug 13, 2016 at 2:46

answered Aug 13, 2016 at 2:36

pistache

6,3211 gold badge32 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

Lin Ma Over a year ago

Thanks pistache, I may mis-interpreter the question. Hello part could be any string, I just use Hello as example, I will update the question. If you have any good ideas, it will be great.

pistache Over a year ago

_ is the variable name I assign to the second capture group. The second capture group temporarily holds each item such as "1N" or "10N", and when it's read after matching it will only return the last match for this subpattern. As I don't actually need this value (but need the group to be in the pattern), I save it to a name that I'm sure ill never use. It would actually be better to make it a non-capturing group (change (\d+N,?) to (?:\d+N,?), and then remove _, from the Python code, but I wasn't sure you knew about non-capturing groups.

pistache Over a year ago

Note that it looks /a little bit/ like your task would be better done using a lexer or a tokenizer than regular expressions. Doesn't change anything in the current example, but maybe you'll face at some point some unexpected shortcomings of regular expressions (due to irregular input) that would make a proper parser required.

pistache Over a year ago

\s? means zero or one whitespace (not greedy, as it is not a wildcard match anyway)/. I added these because the input format didn't seem coherent about whitespaces (there's one before }, but not after {). The ,? is required to match zero-or-one , that will be there except for the last item as it used as separator.

pistache Over a year ago

I just said "not greedy" to confirm the "in a non-greedy way" part in your post. "Greedy" does not apply in this context as it is not a wildcard match. \s? will indeed match zero or one whitespaces (space, tab, newline, etc)

|

jcxu · Accepted Answer · 2016-08-13 11:12:44Z

1

In [82]: string = "Some content  {(1N,2N,3N,4N) Hello } Some content"
In [83]: result = re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)
In [84]: nums = re.findall(r"10N|[1-9]N", result[0][0])
In [85]: nums
Out[85]: ['1N', '2N', '3N', '4N']
In [86]: matchString = result[0][1]
In [87]: matchString
Out[87]: 'Hello'

For the new string:

In [1]: import re

In [2]: string = "{(1N,2N,3N,4N) Hello } Some Content {(5N) World }"

In [3]: re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)
Out[3]: [('(1N,2N,3N,4N)', 'Hello'), ('(5N)', 'World')]

In [4]: result = re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)

In [5]: nums = [re.findall(r"10N|[1-9]N", item[0]) for item in result]

In [6]: nums
Out[6]: [['1N', '2N', '3N', '4N'], ['5N']]

In [7]: matchString = [s[1] for s in result]

In [8]: matchString
Out[8]: ['Hello', 'World']

edited Aug 13, 2016 at 11:12

answered Aug 13, 2016 at 2:53

jcxu

865 bronze badges

8 Comments

Lin Ma Over a year ago

Smart solution! Vote up! Thanks jcxu! Would you mind to explain a bit more when you use ?: 3 times?

Lin Ma Over a year ago

Hi jcxu, tried your solution and met with one new issue, for input string like this {(1N,2N,3N,4N) Hello } Some Content {(5N) World }, I want to get matched information as, Hello with 1N, 2N, 3N, 4N, and World with 5N, i.e. I want to match multiple times in a non-greedy way, after we find a match, we will always continue to match another one, is there a solution?

Lin Ma Over a year ago

Your solution output of my input string above is only Hello with 1N, 2N, 3N, 4N.

jcxu Over a year ago

Sorry to reply so late. I have updated the solution, for your first question, the ?: means I don't need capture the content, because the first parentheses contains all I need, the regex will be faster.

Lin Ma Over a year ago

Thanks jcxu. Tried your solution works pretty good. Just to confirm my understanding is correct, your have 4 match groups, (?:10|[1-9]) and (?:,|\)) are non-capturing group and (\((?: (?:10|[1-9]) N (?:,|\))) and (\w+) are capturing group, correct understanding? Thanks.

|

Collectives™ on Stack Overflow

regular expression match issue in Python 2.7

2 Answers 2

14 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

14 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related