0

I have 2 pandas dataframes. I want to do a find and replace between 2 dataframes. In the df_find dataframe, in the current_title column, i want to search in every row for any occurrence of values from 'keywrod' column in the df_replace dataframe and if found replace it with corresponding value from 'keywordlength' column.

I have been able to get rid of the loop for df_find dataframe since i need to iterate over every row in this dataframe by using str.replace which is a vectorized form of replace function.

Performance matters in my case, as both the dataframes run into GB's. So, i want to get rid of the loop for df_replace here and use any other efficient way of iterating through all rows of df_replace dataframe.

import pandas as pd
df_find = pd.read_csv("input_find.csv")
df_replace = pd.read_csv("input_replace.csv")

#replace
for i,j in zip(df_replace.keyword,df_replace.keywordLength):
    df_find.current_title=df_find.current_title.str.replace(i,j,case=False)

df_replace This dataframe has the data we need for find and replace

keyword       keywordLength
IT Manager    ##10##
Sales Manager ##13##
IT Analyst    ##12##
Store Manager ##13##

df_find is where we need to do the transformation.

Before executing find and replace code:

current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years

After executing find and replace through above code

current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years

I will be ever grateful! Thanks

4
  • Would the matched values be complete matches, or maybe only substring matches? And what if there are multiple matches? Do you just take the first match? Commented May 21, 2017 at 15:16
  • All matches are replaced. It's complete match. Commented May 21, 2017 at 15:20
  • Look into regular expressions and re.sub. You could read the file as text, replace what you want to replace with regex and then open it as a csv. Commented May 21, 2017 at 15:23
  • str.replace is vectorized implementation of re-sub. It performs the operation on the entire column instead of a single row. Commented May 21, 2017 at 15:30

1 Answer 1

1

If I understand you correctly, you should be able to do a relatively simple merge on your data sets (with a few other lines) and get the desired result.

Not having your data sets, I just made up my own. The following code could probably be a bit more elegant, but it gets you where you need to be in four lines, and most importantly - no looping:

Setup:

df_find = pd.DataFrame({
            'current_title':['a','a','b','c','b','c','b','a'],
            'other':['this','is','just','a','bunch','of','random','words']
        })

df_replace = pd.DataFrame({'keyword':['a','c'], 'keywordlength':['x','z']})

Code:

# This line is to simply re-sort at the end of the code.  Someone with more experience can probably bypass this step.
df_find['idx'] = df_find.index

# Merge together the two data sets based on matching the "current_title" and the "keyword"
dfx = df_find.merge(df_replace, left_on = 'current_title', right_on = 'keyword', how = 'outer').drop('keyword', 1)

# Now, copy the non-null "keywordlength" values to "current_title"
dfx.loc[dfx['keywordlength'].notnull(), 'current_title'] = dfx.loc[dfx['keywordlength'].notnull(), 'keywordlength']

# Clean up by dropping the unnecessary columns and resort based on the first line above.
df_find = dfx.sort_values('idx').drop(['keywordlength','idx'], 1)

Output:

  current_title   other
0             x    this
1             x      is
3             b    just
6             z       a
4             b   bunch
7             z      of
5             b  random
2             x   words
Sign up to request clarification or add additional context in comments.

4 Comments

This isn't what i am looking for. I added an example of before and after to make it more clear. Thanks!
Ok, that's exactly why I asked if they were complete matches or substring matches... Regardless, this highlights the importance of posting data sets.
Its a complete match. we are replacing entire keyword with keywordLength. code in the for loop does that. Not sure if you meant something else by 'complete match' but i went by its literal meaning.
would you have any help for me?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.