3

I've a DataFrame with a Company column.

Company
-------------------------------                                                           
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security

I've a dictionary with regexes to normalize the company entities.

(^|\s)corporation(\s|$); Corp 
(^|\s)Limited(\s|$); LTD 
(^|\s)Incorporated(\s|$); INC 
...

I need to normalize only the last occurrence. This is my desired output.

Company
-------------------------------                                                           
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security

(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)

My code:

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

Is it possible to only change the last occurrence of an entity (do i need to change my regex)?

0

1 Answer 1

5

Change (\s|$) to ($) for match end of strings:

entity_dict = {'(^|\s)corporation($)': ' Corp',
               '(^|\s)Limited($)': ' LTD',
               '(^|\s)Incorporated($)': ' INC'}

for k, v in entity_dict.items():
    df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)

print (df)
                          Company
0      Tundra Corporation Art LTD
1             Desert Networks INC
2  Mount Yellowhive Security Corp

EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:

entity_dict = {'corporation': 'Corp',
               'Limited': 'LTD',
               'Incorporated': 'INC'}

lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')

df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
                                 Company
0             Tundra Corporation Art LTD
1                    Desert Networks INC
2         Mount Yellowhive Security Corp
3  Carter, Rath and Mueller LTD (USD/AC)
4                  Barrows Corp /PACIFIC
5                 Corp, Mounted Security
Sign up to request clarification or add additional context in comments.

11 Comments

I can't do that because sometimes it's a bit messed up and the entity is not always at the end of the string as you would expect. I try to cleanup the data. That's why i try to normalize only the last occurrence. That gives me the desired result for now
@JohnDoe - hmmm, is possible change sample data?
@JohnDoe - I am now confused, why only first Tundra Corporation is not changed?
@JohnDoe - So if multiple matched values is necessary change only last one? E.g. if Desert Limited Networks Incorporated then need Desert Limited Networks INC ?
@JohnDoe - because str.findall - it matched uppercase/lowercase values by dict - here is same use both dicts s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('') same like s1 = df['Company'].str.findall('|'.join(entity_dict.keys()), flags=re.I).str[-1].fillna(''), but for map is necessary exact match, so used lower dictionary
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.