1

I'm trying to validate, if zip code for many different countries which stored in a table are in a correct format, as an example:

ZIP COUNTRY_CODE
1033 SC NL
60593 DE

To do that I have a separate DF with country code and regular expressions as a pattern for every zip code.

REGEX COUNTRY
\d{4}[ ]?[A-Z]{2} NL
\d{5} DE

I'm trying to merge this to tables based on a coutry code and then create which indicates as True or False if the zip code based on regex is correct.

Here is my currecnt code:

df_merged = pd.merge(regex_df, zip_df,  left_on = 'CODE',   right_on= 'COUNTRY_CODE')
df_merged['zip_correct'] = df_mergedf.CODE_y.str.contains(df_merged.REGEX.str, regex= True, na=False)

Hovewer I'm etting only false results since pandas is cheking the regex pattern in every row. How could I limit it to check it row by row?

Expected output:

ZIP COUNTRY_CODE ZIP_CORRECT
1033 SC NL TRUE
60593 DE TRUE
6059TT DE FALSE

Could you please help?

2 Answers 2

2

I assume you might have many countries. You can apply re.match with the matching country regex on each row:

import re

s = regex_df.set_index('COUNTRY')['REGEX']

df['ZIP_CORRECT'] = df.apply(lambda x: bool(re.match('%s$' % s[x['COUNTRY_CODE']], x['ZIP'])), axis=1)

output:

       ZIP COUNTRY_CODE  ZIP_CORRECT
0  1033 SC           NL         True
1    60593           DE         True
2   6059TT           DE        False
Sign up to request clarification or add additional context in comments.

Comments

1

You can use

import pandas as pd
import re

zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593"], "COUNTRY_CODE": ["NL","DE"]})
regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})

d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)

The d dictionary will look like {'NL': '\\d{4}[ ]?[A-Z]{2}', 'DE': '\\d{5}'}, and within the .apply(), each ZIP value from the zip_df dataframe is checked against the appropriate regex pattern from the dictionary as it is obtained form the d dictionary using the zip_df COUNTRY_CODE value as a key.

Note that re.fullmatch only returns True if the regex pattern fully matches the string, i.e. matches the entire string.

Test in Pandas (copy/pasted):

>>> import pandas as pd
>>> import re
>>> zip_df = pd.DataFrame({"ZIP":["1033 SC", "60593", "6059TT"], "COUNTRY_CODE": ["NL","DE","DE"]})
>>> regex_df = pd.DataFrame({"REGEX":[r"\d{4}[ ]?[A-Z]{2}", "\d{5}"], "COUNTRY": ["NL","DE"]})
>>> d = dict(regex_df[['COUNTRY', 'REGEX']].to_numpy())
>>> zip_df['ZIP_CORRECT'] = zip_df.apply(lambda x: bool(re.fullmatch(d[x['COUNTRY_CODE']], x['ZIP'])), axis=1)
>>> zip_df
       ZIP COUNTRY_CODE  ZIP_CORRECT
0  1033 SC           NL         True
1    60593           DE         True
2   6059TT           DE        False
>>> 

8 Comments

nice, I forgot about re.fullmatch ;)
@mozway Yeah, it is convenient here. Something is happening to SO editing feature, I have to confirm every edit today :(
me too, and then it shows a blank post until I refresh, hope it will be fixed soon ;)
@WiktorStribiżew Q2 : why going through a dict? Would it not be simpler to regex_df.set_index('COUNTRY') then in the lambda get the regex via regex_df['REGEX'][x['COUNTRY_CODE']]
@hpchavaz Because it works. Why not?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.