Python regex - address parsing

Question

I'm trying to parse two types of one-line address strings:

Flat XXX, XXX <Building name>, <City/town>, <State> <Postcode>

DDD <Generic place name>, <Road name> road, <City/town>, <State>

using using the following regex

re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?

Here XXX is some alphanumeric string, and DDD is a number. I expect group 1 to be Flat XXX if the address is of the first type or None if not, group 2 to be XXX <Building name>, <City/town>, <State> if the address if of the first type, or <Road name> road, <City/town>, <State> if it is of the second type, and group 3 to be the postcode if the address is of the first type or None if not. The postcode is a UK postcode for which my regex (not comprehensively accurate but mostly correct for my purpose) is [a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2}. Case is to be ignored and there may be no comma between Flat XXX (if it exists) and <Building name>, and there may be a comma between the city and the postcode (if it exists).

>>> address1 = 'Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY'
>>> re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?', address1, re.I).groups()
>>> ('Flat 29', 'Victoria House, Redwood Lane, Richmond, London SW14 9XY', None)
>>> address2 = '91 Fleet, Major Road, Fleet, Hampshire'
>>> re.search(r'(Flat \w+)?\W*(.+)\W*([a-zA-Z]{1,2}\d+\s+\d+[a-zA-Z]{1,2})?', address2, re.I).groups()
>>> (None, '91 Fleet, Major Road, Fleet, Hampshire', None)

I am not sure what is going wrong, but I think the middle group ..\W*(.+)\W*.. is more or less capturing everything.

Have you considered and tried the non-greedy version: ..\W*(.+?)\W*..? — user707650
– user707650, Commented Dec 15, 2016 at 9:52
I described what I need: I expect group 1 to be Flat XXX if the address is of the first type or None if not, group 2 to be XXX <Building name>, <City/town>, <State> if the address if of the first type, or <Road name> road, <City/town>, <State> if it is of the second type, and group 3 to be the postcode if the address is of the first type or None if not. — srm
– srm, Commented Dec 15, 2016 at 9:54
The non-greedy version gives me ('Flat 29', 'V', None) for the first type of address, e.g. for address1 = Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY. — srm
– srm, Commented Dec 15, 2016 at 9:56

asongtoruin · Accepted Answer · 2016-12-15 13:09:03Z

It's not particularly elegant, but here's a bit of a workaround (assuming that <State> doesn't contain any digits):

import re
addresses = ['Flat 29, Victoria House, Redwood Lane, Richmond, London SW14 9XY',
             '91 Fleet, Major Road, Fleet, Hampshire']

regexp = re.compile(r'(Flat \w+)?[,\s]*(.*)\s([a-zA-Z]{1,2}\d+\s?+\d+[a-zA-Z]{1,2}|\D*)$', re.I)

for address in addresses:
    sep_addr = list(re.search(regexp, address).groups())
    if not any(x.isdigit() for x in sep_addr[2]):
        sep_addr[1] +=  ' ' + sep_addr[2]
        sep_addr[2] = None
    print sep_addr

We set group 2 to be either the postcode or the last word in the provided address. Then by checking if there are any digits in the result of our second group, we know if it's a postcode or not. If it isn't, we append to group 1 to give the full address part, and set group 2 to None. This returns:

['Flat 29', 'Victoria House, Redwood Lane, Richmond, London', 'SW14 9XY']
[None, '91 Fleet, Major Road, Fleet, Hampshire', None]

EDIT: added an optional to the space in the middle of the postcode, to ensure space-less postcodes are still matched.

Collectives™ on Stack Overflow

Python regex - address parsing

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related