0

I am very new in regex manipulation. I am using Python 3.3 in filtering addresses with the module re.

I am wondering why the following regex :

m3 = re.search("[ ,]*[0-9]{1,3}\s{0,1}(/|-|bt.)\s{0,1}[0-9]{1,3} ",Row[3]);

matches string like:

rue de l’hotel des monnaies 49-51 1060Bxl
av Charles Woeste309 bte2 -Bxl
Rue d'Anethan 46 bte 6
Avenue Defré 269/6

but does not match string like (m3 is None):

Avenue Guillaume de Greef,418 bte 343
Joseph Cuylits,24 bte5 Rue Louis
Ernotte 64 bte 3
Rue Saint-Martin 51 bte 7

This really looks like strange to me. All explanation are welcome. Thank you.

2
  • What is the regular pattern the addresses follow? Commented Oct 29, 2014 at 21:50
  • Welcome to Python. No need to put semicolons after your statements, unless you want to be identified as a convert from Java/C/Javascript ;-) Commented Oct 29, 2014 at 22:17

1 Answer 1

1

Seems like the trailing space " " at the end of your regex was unintentional and is breaking things: "[ ,]*[0-9]{1,3}\s{0,1}(/|-|bt.)\s{0,1}[0-9]{1,3} "

The regex which re.search is looking for means the following (recommend you use the re.VERBOSE/re.X flag to allow you to put comments inside a regex, so it doesn't quickly become read-only ;-). Note that using multiline string """ with re.VERBOSE now means we can't even insert that " " character (you'd have to use [ ] or else \s)

import re

addr_pat = re.compile("""
    [ ,]*       # zero or more optional leading space or commas
    [0-9]{1,3}  # 1-3 consecutive digits
    \s{0,1}     # one optional whitespace (instead you could just write \s?)
    (/|-|bt.)   # either forward-slash, minus or "bt[any character]" e.g. "bte"
    \s{0,1}     # one optional whitespace
    [0-9]{1,3}  # 1-3 consecutive digits
                # we omitted the trailing " " whitespace you inadvertently had
""", re.VERBOSE)

m3 = addr_pat.search("Rue Saint-Martin 51 bte 7 ")

The requirement for a trailing space is why each of the following fail to match:

Avenue Guillaume de Greef,418 bte 343
Joseph Cuylits,24 bte5 Rue Louis
Ernotte 64 bte 3
Rue Saint-Martin 51 bte 7
Sign up to request clarification or add additional context in comments.

6 Comments

Yes this is what I intended to do. The first and the two last strings will not match because the lack of trailing space, thanks for pointing out this this mistake. But why did the second not match, it looks like it must?
The inadvertent requirement for trailing space is breaking all four of these. Just remove it from the regex!
But there is a trailing space in 'Joseph Cuylits,24 bte5*Rue Louis'
Do you mean " " equals [ ]$ ?
"bte5 Rue Louis " matches \s{0,1} (0/1 internal whitespaces) but then does not have any subsequent 1-3 digit number [0-9]{1,3} followed by trailing space ` `.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.