0

This is an extract of my DataFrame

data = [
    ['Citroën Amillis', '20 Za Des Baliveaux - 77120 Amillis', '77120', 'ok'],
    ['Relat Paris 9e', 'Métro Opéra - 75009 Paris 9e', 'Paris', 'error'],
    ['Macif Avon', '49 Av Franklin Roosevelt - 77210 Avon', '77210', 'ok'],
    ['Atac La Chapelle-la-Reine', 'Za Rue De L\'avenir - 77760 La Chapelle-la-Reine', 'La', 'error'],
    ['Société Générale La Ferté-Gaucher', '42 Rue De Paris - 77320 La Ferté-Gaucher', 'La', 'error']
]

df = pd.DataFrame(data, columns=['nom_magasin', 'adresse', 'code_postal', 'is_code_postal'])

df

As you can see, there are mistakes in my dataframe. For some addresses, especially when the city name is composed (ex:"La Chapelle-la-Reine"), the column "code_postal" is wrong.

What I'm looking to do is the following: if the column "is_code_postal" is an "error", replace "code_postal" by the regex of the postal code appearing in the column "adresse".

I can't find the solution. To do I've try this df['is_code_postal'] = np.where(df.code_postal.str.match('^[a-zA-z]'), 'error', 'ok'). At first I was thinking about doing all changes within the same function. But I'm missing something.

And the important thing is that my dataframe is a little bit heavy (more than 250K rows) so I'd like to go for an effective solution.

Do you guys have any idea?

3
  • does df['adresse'].str.extract('(\d{5})') work for you? Commented Jan 13, 2020 at 16:39
  • Doing this ? df['is_code_postal'] = np.where(df.code_postal.str.match('^[a-zA-z]'), df['adresse'].str.extract('(\d{5})'), 'ok') @QuangHoang Commented Jan 13, 2020 at 16:48
  • df['adresse'].str.extract('(\d{5})') gives you the postal. You can compare those to df['code_postal'] Commented Jan 13, 2020 at 16:50

1 Answer 1

2

You could just ignore the code_postal and extract it directly from 'adresse', using the code from Quang:

df['code_postal']=df['adresse'].str.extract('(\d{5})')
Sign up to request clarification or add additional context in comments.

2 Comments

this would over write all postal codes, how would you apply this to only incorrect ones?
you can select only those rows via df.loc[df.is_code_postal == 'error', 'code_postal'] = df.adresse.str.extract(r'\- (\d{5})')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.