1

Using Python 2.7, want to use regular expression to find the Hello part of a given string. The rule is, Hello maybe in the pattern starts with {(1N), {(2N) (until 10N), or combination of them {(1N,2N,3N,4N), and ends with }.

Besides match the Hello part, I also want to know if 1N match, or 2N match or 10N match, or either 1N or 2N match.

Any solutions are appreciated.

  Some content  {(1N,2N,3N,4N) Hello } Some content 

  Some content  {(1N) Python } Some content 

  Some content {(2N) Regex } Some content 

In the first example, I want to know 1N,2N,3N,4N matches, and the matched string is Hello;

In the 2nd example, I want to know 1N matches, and matched string is Python; In the 3rd example, I want to know 2N matches, and matched string is Regex;

regards, Lin

4
  • 2
    Can you show your own attempt ? Commented Aug 13, 2016 at 2:24
  • @pistache, I tried to match {(1N (.*?) } through {(10N (.*?) }, match 10 times for a given string, which sounds a bit stupid and I also need to remove some prefix unnecessary matches, so it is why I come here to ask. Do you have some more efficient solutions? :) Commented Aug 13, 2016 at 2:29
  • 1
    Have you tried playing around in any of the online Python regex testers? Since you have so many variables you might have to make several passes - first write an expression to match the \{\(.*?\) Hello \} part. Commented Aug 13, 2016 at 2:41
  • Thanks @wwii, I may mis-interpreter the question. Hello part could be any string, I just use Hello as example, I will update the question. If you have any good ideas, it will be great. Commented Aug 13, 2016 at 2:44

2 Answers 2

2

Regular expressions cannot really count (which is why you say you tried to write 10 times the same pattern), but instead you can match the sequence and then split to count :

In [100]: match = re.compile(r"\{\s?\(\s?((\d+N,?)+)\)\s?(.*)\s?\}").search("Some content  { (1N,2N,3N,4N) Hello } Some content")

In [101]: items, _, text = match.groups()

In [102]: splitted = items.split(',')

In [103]: print(splitted)
['1N', '2N', '3N', '4N']

In [104]: print(text)
Hello 

NOTE: All the \s? are there to handle optional blanks, remove them if you know you don't need at certain places.

Sign up to request clarification or add additional context in comments.

14 Comments

Thanks pistache, I may mis-interpreter the question. Hello part could be any string, I just use Hello as example, I will update the question. If you have any good ideas, it will be great.
_ is the variable name I assign to the second capture group. The second capture group temporarily holds each item such as "1N" or "10N", and when it's read after matching it will only return the last match for this subpattern. As I don't actually need this value (but need the group to be in the pattern), I save it to a name that I'm sure ill never use. It would actually be better to make it a non-capturing group (change (\d+N,?) to (?:\d+N,?), and then remove _, from the Python code, but I wasn't sure you knew about non-capturing groups.
Note that it looks /a little bit/ like your task would be better done using a lexer or a tokenizer than regular expressions. Doesn't change anything in the current example, but maybe you'll face at some point some unexpected shortcomings of regular expressions (due to irregular input) that would make a proper parser required.
\s? means zero or one whitespace (not greedy, as it is not a wildcard match anyway)/. I added these because the input format didn't seem coherent about whitespaces (there's one before }, but not after {). The ,? is required to match zero-or-one , that will be there except for the last item as it used as separator.
I just said "not greedy" to confirm the "in a non-greedy way" part in your post. "Greedy" does not apply in this context as it is not a wildcard match. \s? will indeed match zero or one whitespaces (space, tab, newline, etc)
|
1
In [82]: string = "Some content  {(1N,2N,3N,4N) Hello } Some content"
In [83]: result = re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)
In [84]: nums = re.findall(r"10N|[1-9]N", result[0][0])
In [85]: nums
Out[85]: ['1N', '2N', '3N', '4N']
In [86]: matchString = result[0][1]
In [87]: matchString
Out[87]: 'Hello'

For the new string:

In [1]: import re

In [2]: string = "{(1N,2N,3N,4N) Hello } Some Content {(5N) World }"

In [3]: re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)
Out[3]: [('(1N,2N,3N,4N)', 'Hello'), ('(5N)', 'World')]

In [4]: result = re.findall(r"(\((?:(?:10|[1-9])N(?:,|\)))+)\s*(\w+)", string)

In [5]: nums = [re.findall(r"10N|[1-9]N", item[0]) for item in result]

In [6]: nums
Out[6]: [['1N', '2N', '3N', '4N'], ['5N']]

In [7]: matchString = [s[1] for s in result]

In [8]: matchString
Out[8]: ['Hello', 'World']

8 Comments

Smart solution! Vote up! Thanks jcxu! Would you mind to explain a bit more when you use ?: 3 times?
Hi jcxu, tried your solution and met with one new issue, for input string like this {(1N,2N,3N,4N) Hello } Some Content {(5N) World }, I want to get matched information as, Hello with 1N, 2N, 3N, 4N, and World with 5N, i.e. I want to match multiple times in a non-greedy way, after we find a match, we will always continue to match another one, is there a solution?
Your solution output of my input string above is only Hello with 1N, 2N, 3N, 4N.
Sorry to reply so late. I have updated the solution, for your first question, the ?: means I don't need capture the content, because the first parentheses contains all I need, the regex will be faster.
Thanks jcxu. Tried your solution works pretty good. Just to confirm my understanding is correct, your have 4 match groups, (?:10|[1-9]) and (?:,|\)) are non-capturing group and (\((?: (?:10|[1-9]) N (?:,|\))) and (\w+) are capturing group, correct understanding? Thanks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.