7

According to the pattern match here, the matches are 213.239.250.131 and 014.10.26.06.

Yet when I run the generated Python code and print out the value of re.findall(p, test_str), I get:

[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]

I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re functionality.

Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?

2
  • 8
    Can't you put the match pattern here? Just for us, Please... Commented Jun 11, 2015 at 21:41
  • 3
    a) Post the damn regex, here, already. You can't incorporate-by-link to third-party site, which might go away in the future. b) The 'version 3' of your regex squelches the empty capture groups. So if you've already solved the problem using the answers here, please close the question. Commented Jun 11, 2015 at 22:19

2 Answers 2

13

Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall result.

In regex101.com, you can see these non-participating groups by turning them on in Options:

enter image description here

You may tighten up your regex by removing capturing groups:

(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}.

See a regex demo

And since the regex pattern does not contain capturing groups, re.findall will only return matches, not capturing group contents:

import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <[email protected]> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))

Output of the online Python demo:

['213.239.250.131', '014.10.26.06']
Sign up to request clarification or add additional context in comments.

2 Comments

Fantastic response, and thanks for making me aware of IDEONE.
Improved the answer to remove unnecessary code and complications.
1

these are the capturing groups. if you do or queries it will return empty matches for the non matching expressions.

(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

the first or has 2 groups:
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})

and after the or there is the third:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

to say it in a simple way each round bracket defines a capturing group which will show up if the value matches or not.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.