0

I'm trying to parse two types of one-line address strings:

Flat XXX, XXX <Building name>, <City/town>, <State> <Postcode>

DDD <Generic place name>, <Road name> road, <City/town>, <State>

using using the following regex

re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?

Here XXX is some alphanumeric string, and DDD is a number. I expect group 1 to be Flat XXX if the address is of the first type or None if not, group 2 to be XXX <Building name>, <City/town>, <State> if the address if of the first type, or <Road name> road, <City/town>, <State> if it is of the second type, and group 3 to be the postcode if the address is of the first type or None if not. The postcode is a UK postcode for which my regex (not comprehensively accurate but mostly correct for my purpose) is [a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2}. Case is to be ignored and there may be no comma between Flat XXX (if it exists) and <Building name>, and there may be a comma between the city and the postcode (if it exists).

>>> address1 = 'Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY'
>>> re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?', address1, re.I).groups()
>>> ('Flat 29', 'Victoria House, Redwood Lane, Richmond, London SW14 9XY', None)
>>> address2 = '91 Fleet, Major Road, Fleet, Hampshire'
>>> re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?', address2, re.I).groups()
>>> (None, '91 Fleet, Major Road, Fleet, Hampshire', None)

I am not sure what is going wrong, but I think the middle group ..\W*(.+)\W*.. is more or less capturing everything.

7
  • Yes, it is: regex101.com/r/uC6fiZ/1 Commented Dec 15, 2016 at 9:47
  • What do you ultimately need to get from the addresses? Commented Dec 15, 2016 at 9:52
  • Have you considered and tried the non-greedy version: ..\W*(.+?)\W*..? Commented Dec 15, 2016 at 9:52
  • I described what I need: I expect group 1 to be Flat XXX if the address is of the first type or None if not, group 2 to be XXX <Building name>, <City/town>, <State> if the address if of the first type, or <Road name> road, <City/town>, <State> if it is of the second type, and group 3 to be the postcode if the address is of the first type or None if not. Commented Dec 15, 2016 at 9:54
  • The non-greedy version gives me ('Flat 29', 'V', None) for the first type of address, e.g. for address1 = Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY. Commented Dec 15, 2016 at 9:56

1 Answer 1

1

It's not particularly elegant, but here's a bit of a workaround (assuming that <State> doesn't contain any digits):

import re
addresses = ['Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY',
             '91 Fleet, Major Road, Fleet, Hampshire']

regexp = re.compile(r'(Flat \w+)?[,\s]*(.*)\s([a-zA-Z]{1,2}\d+\s?+\d+[a-zA-Z]{1,2}|\D*)$', re.I)

for address in addresses:
    sep_addr = list(re.search(regexp, address).groups())
    if not any(x.isdigit() for x in sep_addr[2]):
        sep_addr[1] +=  ' ' + sep_addr[2]
        sep_addr[2] = None
    print sep_addr

We set group 2 to be either the postcode or the last word in the provided address. Then by checking if there are any digits in the result of our second group, we know if it's a postcode or not. If it isn't, we append to group 1 to give the full address part, and set group 2 to None. This returns:

['Flat 29', 'Victoria House, Redwood Lane, Richmond, London', 'SW14 9XY']
[None, '91 Fleet, Major Road, Fleet, Hampshire', None]

EDIT: added an optional to the space in the middle of the postcode, to ensure space-less postcodes are still matched.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.