Replace string pattern in a data frame

Question

suppose I have a data frame:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

How can I replace all cells which include the pattern Yemen with the string Yemen? The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

In the next step, is it possible to replace all cells including Andorra with Andorra and all cells including Yemen with Yemen in one step, using lists or dictionaries?

The result should be:

df = pd.DataFrame({'Country': ['Aruba', 'Andorra', 'Afgahnistan', 'Yemen', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})

I tried for example

df.replace(regex='lorem Andorra ipsum', value='Andorra ')

which works for sure, since it looks for lorem Andorra ipsum. But that is a too specific approach. I further tried different reg expressions as

df.replace(regex=r'^Andorra.$', value='Andorra'). But it didn't work.

I appreciate any help in advance!

Maybe just df['Country'][df['Country'].str.contains(r"\bYemen\b", case=False)] = 'Yemen'? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jan 22, 2021 at 14:47

al_qaysee · Accepted Answer · 2021-01-22 13:37:43Z

3

first of all, put all patterns in a single array then loop this array, on top of another loop that is looping in the df, to search for the pattern and replace.

Here is a code:

patterns = ["Yemen","Andorra"]
for pattern in patterns: 
  for index, row in df.iterrows(): 
    if row[0].lower().find(pattern.lower()) != -1 :  
        df['Country'][index] = pattern
        break
print(df)

And the reult:

        Country             Geographic region
0         Aruba   Latin America and Caribbean
1       Andorra       Europe and Central Asia
2   Afgahnistan                    South Asia
3         Yemen  Middle East and North Africa
4  South Africa            Sub Saharan Africa

answered Jan 22, 2021 at 13:37

al_qaysee

811 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

chris_tri Over a year ago

Thank you very much for the answer. It works perfectly and is quite a straight forward solution and easy to understand!

Wiktor Stribiżew · Accepted Answer · 2021-01-22 15:00:56Z

1

You may use this approach if the number of words is not too huge:

import pandas as pd
df = pd.DataFrame({'Country': ['Aruba', 'lorem Andorra ipsum', 'Afgahnistan', 'Bla Yemen, Rep.', 'South Africa'],
 'Geographic region': ['Latin America and Caribbean', 'Europe and Central Asia', 'South Asia', 'Middle East and North Africa', 'Sub Saharan Africa']})
l = ['Yemen', 'Andorra']
>>> df['Country'].str.replace(fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.*", r"\1")
0           Aruba
1         Andorra
2     Afgahnistan
3           Yemen
4    South Africa
Name: Country, dtype: object

The fr"(?s).*?\b({'|'.join(sorted(l, key=len, reverse=True))})\b.* code creates a regex using the l list strings as alternation in the capturing group (with ID 1) and looks like

(?s).*?\b(Andorra|Yemen)\b.*

See the regex demo. Note the longer country names will be put first, so as to match the longest country name from country names sharing the same prefix.

Details:

(?s) - a re.DOTALL inline modifier, makes . match line break chars, too
.*? - any 0 or more chars, as few as possible
\b - word boundary
(Andorra|Yemen) - either Andorra or Yemen string
\b - word boundary
.* - the rest of the string.

edited Jan 22, 2021 at 15:00

answered Jan 22, 2021 at 14:55

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

1 Comment

chris_tri Over a year ago

Thanks a lot for the answer and especially the regex explanation. They help a lot! Also on different exercises!

Collectives™ on Stack Overflow

Replace string pattern in a data frame

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related