Replacing a column value by another column value based on regex - Python

Question

This is an extract of my DataFrame

data = [
    ['Citroën Amillis', '20 Za Des Baliveaux - 77120 Amillis', '77120', 'ok'],
    ['Relat Paris 9e', 'Métro Opéra - 75009 Paris 9e', 'Paris', 'error'],
    ['Macif Avon', '49 Av Franklin Roosevelt - 77210 Avon', '77210', 'ok'],
    ['Atac La Chapelle-la-Reine', 'Za Rue De L\'avenir - 77760 La Chapelle-la-Reine', 'La', 'error'],
    ['Société Générale La Ferté-Gaucher', '42 Rue De Paris - 77320 La Ferté-Gaucher', 'La', 'error']
]

df = pd.DataFrame(data, columns=['nom_magasin', 'adresse', 'code_postal', 'is_code_postal'])

df

As you can see, there are mistakes in my dataframe. For some addresses, especially when the city name is composed (ex:"La Chapelle-la-Reine"), the column "code_postal" is wrong.

What I'm looking to do is the following: if the column "is_code_postal" is an "error", replace "code_postal" by the regex of the postal code appearing in the column "adresse".

I can't find the solution. To do I've try this df['is_code_postal'] = np.where(df.code_postal.str.match('^[a-zA-z]'), 'error', 'ok'). At first I was thinking about doing all changes within the same function. But I'm missing something.

And the important thing is that my dataframe is a little bit heavy (more than 250K rows) so I'd like to go for an effective solution.

Do you guys have any idea?

Doing this ? df['is_code_postal'] = np.where(df.code_postal.str.match('^[a-zA-z]'), df['adresse'].str.extract('(\d{5})'), 'ok') @QuangHoang — Grégoire de Kermel
– Grégoire de Kermel, Commented Jan 13, 2020 at 16:48
df['adresse'].str.extract('(\d{5})') gives you the postal. You can compare those to df['code_postal'] — Quang Hoang
– Quang Hoang, Commented Jan 13, 2020 at 16:50

gdnaes · Accepted Answer · 2020-01-13 16:52:13Z

2

You could just ignore the code_postal and extract it directly from 'adresse', using the code from Quang:

df['code_postal']=df['adresse'].str.extract('(\d{5})')

answered Jan 13, 2020 at 16:52

gdnaes

1788 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Umar.H Over a year ago

this would over write all postal codes, how would you apply this to only incorrect ones?

Vivian L. Over a year ago

you can select only those rows via df.loc[df.is_code_postal == 'error', 'code_postal'] = df.adresse.str.extract(r'\- (\d{5})')

Collectives™ on Stack Overflow

Replacing a column value by another column value based on regex - Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related