2

Maybe some regex-Master can solve my problem.

I have a big list with many addresses with no seperators( , ; ). The address string contains following Information:

  • The first group is the street name
  • The second group is the street number
  • The third group is the zipcode (optional)
  • The last group is the town name (optional)

regex_png

As you can see on the image above the last two test strings are not matching. I need the last two regex groups to be optional and the third group should be either 4 or 5 digits.

I tried (\d{4,5}) for allowing 4 and 5 digits. But this only works halfways as you can see here: https://regex101.com/r/ZurqHh/1
regex_4_5_digits (This sometimes mixes the street number and zipcode together)

I also tried (?:\d{5})? to make the third and fourth group optional. But this destroys my whole group layout... https://regex101.com/r/EgxeMy/1

regex_optional

This is my current regex:

/^([a-zäöüÄÖÜß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+\/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/im

Try it out yourself: https://regex101.com/r/zC8NCP/1

My brain is only farting at this moment and i can't think straight anymore.

Please help me fix this problem so i can die in peace.

2
  • So is the street number optional as well? I noticed that for the first address it doesn't seem like there is one. Commented Feb 15, 2022 at 12:54
  • @OddOneOut yes the street number is also optional Commented Feb 15, 2022 at 12:57

2 Answers 2

2

You can use

^(.*?)(?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))?(?:\s+(\d{4,5})(?:\s+(.*))?)?$

See the regex demo (note all \s are replaced with \h to only match horizontal whitespaces).

Details:

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars other than line break chars
  • (?:\s+(\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b))? - an optional non-capturing group matching
    • \s+ - one or more whitespaces
    • (\d+(?:\s*[-|+\/]\s*\d+)*\s*[a-z]?\b) - Group 2:
      • \d+ - one or more digits
      • (?:\s*[-|+\/]\s*\d+)* - zero or more sequences of zero or more whitespaces, -, +, | or /, zero or more whitespaces, one or more digits
      • \s* - zero or more whitespaces
      • [a-z]?\b - an optional lowercase ASCII letter and a word boundary
  • (?:\s+(\d{4,5})\b(?:\s+(.*))?)? - an optional non-capturing group matching
    • \s+ - one or more whitespaces
    • (\d{4,5}) - Group 3: four or five digits
    • (?:\s+(.*))? - an optional sequence of one or more whitespaces and then any zero or more chars other than line break chars as many as possible
  • $ - end of string.

Please note that the (?:\s+(.*))? optional group must be inside the (?:\s+(\d{4,5})...)? group to work.

Sign up to request clarification or add additional context in comments.

2 Comments

Beautiful!!! 🙂
Wow thank you for the detailed description!! This helped me alot! 🙂
0

It is difficult to parse addresses because we are halfway between formatted text and natural language. Here is a pattern that tries as much as possible to reduce the number of optional parameters to succeed with the examples offered without asking too much to the regex engine. To do this, I mainly rely on character classes, atomic groups, and a relatively accurate description of the street names. Obviously, all the examples of the question cannot be representative of reality and characters could be added or removed from the classes to deal with new cases. Nevertheless, the structure of this pattern is a good starting point.

~
^
(?<strasse> [\pL\d-]+ \.? (?> \h+ [\pL\d-]+ \.? )*? ) \h*
(?<nummer> \b (?> \d+ | [-+/\h]+ | [a-z] \b )*? )
(?: \h+ (?<plz> \d{4,5} )
    \h+ (?<stadt> .+ ) )?
$
~mxui

demo

Note that in the above link you can also see a previous version of this pattern with a more accurate description of the street number (a bit more efficient but longer).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.