Replace only last occurrence of column value in DataFrame

Question

I've a DataFrame with a Company column.

Company
-------------------------------                                                           
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security

I've a dictionary with regexes to normalize the company entities.

(^|\s)corporation(\s|$); Corp 
(^|\s)Limited(\s|$); LTD 
(^|\s)Incorporated(\s|$); INC 
...

I need to normalize only the last occurrence. This is my desired output.

Company
-------------------------------                                                           
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security

(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)

My code:

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

Is it possible to only change the last occurrence of an entity (do i need to change my regex)?

jezrael · Accepted Answer · 2019-04-04 10:18:46Z

5

Change (\s|$) to ($) for match end of strings:

entity_dict = {'(^|\s)corporation($)': ' Corp',
               '(^|\s)Limited($)': ' LTD',
               '(^|\s)Incorporated($)': ' INC'}

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

print (df)
                          Company
0      Tundra Corporation Art LTD
1             Desert Networks INC
2  Mount Yellowhive Security Corp

EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:

entity_dict = {'corporation': 'Corp',
               'Limited': 'LTD',
               'Incorporated': 'INC'}

lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')

df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
                                 Company
0             Tundra Corporation Art LTD
1                    Desert Networks INC
2         Mount Yellowhive Security Corp
3  Carter, Rath and Mueller LTD (USD/AC)
4                  Barrows Corp /PACIFIC
5                 Corp, Mounted Security

edited Apr 4, 2019 at 10:18

answered Apr 4, 2019 at 6:58

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

John Doe Over a year ago

I can't do that because sometimes it's a bit messed up and the entity is not always at the end of the string as you would expect. I try to cleanup the data. That's why i try to normalize only the last occurrence. That gives me the desired result for now

jezrael Over a year ago

@JohnDoe - hmmm, is possible change sample data?

jezrael Over a year ago

@JohnDoe - I am now confused, why only first Tundra Corporation is not changed?

jezrael Over a year ago

@JohnDoe - So if multiple matched values is necessary change only last one? E.g. if Desert Limited Networks Incorporated then need Desert Limited Networks INC ?

jezrael Over a year ago

@JohnDoe - because str.findall - it matched uppercase/lowercase values by dict - here is same use both dicts s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('') same like s1 = df['Company'].str.findall('|'.join(entity_dict.keys()), flags=re.I).str[-1].fillna(''), but for map is necessary exact match, so used lower dictionary

|

Collectives™ on Stack Overflow

Replace only last occurrence of column value in DataFrame

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related