Merge two pandas dataframes in python with a common column and with condition on string

Question

I have two dfs in python :

df1

    folder_name   name
0   f1          aa
1   g1          bb


df2   
    name        icon
0   aa          i1
1   bb          i2
2   aadoq       i3
3   bbaddd      i4

Desired output:

df   
    folder_name  name    icon
0   f1           aa      i1
1   g1           bb      i2
2   f1           aadoq   i3
3   g1           bbaddd  i4

I tried merging them but it seemed wrong

pd.merge(df1,df2,on='name',how='right')

i am getting :

      folder_name  name    icon
  0   f1           aa      i1
  1   g1           bb      i2
  2   NAN          aadoq   i3
  3   NAN          bbaddd  i4

So if the prefix string in df2 name column matches any name column itme in df1 i want that folder name to be there for that name column in output

It's df2.merge(df1, on='name', how='left'), we've already been over this :-) — cs95
– cs95, Commented Dec 19, 2018 at 16:48
@coldspeed, that doesn't accomplish OPs purpose. That's how they are currently getting the undesired result. — G. Anderson
– G. Anderson, Commented Dec 19, 2018 at 16:51
@G.Anderson Ah, my bad, I glossed over this post and assumed that was their expected o/p. I will reopen. — cs95
– cs95, Commented Dec 19, 2018 at 16:51
Your example has inconsistent folder name ('d1' at the beginning and 'g1' rest of the example. Please, correct that) — Raj006
– Raj006, Commented Dec 19, 2018 at 17:00

Raj006 · Accepted Answer · 2018-12-19 19:34:40Z

1

Based on your question, here are the prerequisites I got

Python (I used version 3, but should not be much different for version 2)
Pandas data frames
The folder name is not restricted to the length=2

Here is my Python code. I used the Python regex module. I selected the "name" list from both dataframes, checked if any name in dataframe 1 matches the name in dataframe 2 (match in python regex means to match from the beginning of the string). Created a new list called MappedName based on these criteria where there is a match use the matched name value from dataframe 1 if not, use the name value from dataframe 2. Added this list as a new column to dataframe 2. Used the 'name' column from dataframe 1 and 'MappedName' column from dataframe 2 for merge criteria.

I added one extra data point to the dataframe 2 to show what happens when there is no match with the regular expressions.

from pandas import DataFrame
import re

df1=DataFrame({'folder_name':['f1','g1'],'name':['aa','bb']})
df2=DataFrame({'name':['aa','bb','aadoq','bbaddd','ding'],'icon':['i1','i2','i3','i4','i5']})
df1_name_list=df1['name']
df2_name_list=df2['name']
MappedName=[]
for name2 in df2_name_list:
    for name1 in df1_name_list:
        if re.match(name1,name2):
            name2=name1
            break
    MappedName.append(name2)
df2['MappedName']=MappedName
df3=df1.merge(df2,left_on='name',right_on='MappedName',how='right').drop(['name_x','MappedName'],axis=1)
df4=df1.merge(df2,left_on='name',right_on='MappedName').drop(['name_x','MappedName'],axis=1)

print ('\ndf1\n',df1)
print ('\ndf2\n',df2)
print ('\ndf3\n',df3)
print ('\ndf4\n',df4)

The result looks like below

df1
   folder_name name
0          f1   aa
1          g1   bb

df2
      name icon MappedName
0      aa   i1         aa
1      bb   i2         bb
2   aadoq   i3         aa
3  bbaddd   i4         bb
4    ding   i5       ding

df3
   folder_name  name_y icon
0          f1      aa   i1
1          f1   aadoq   i3
2          g1      bb   i2
3          g1  bbaddd   i4
4         NaN    ding   i5

df4
   folder_name  name_y icon
0          f1      aa   i1
1          f1   aadoq   i3
2          g1      bb   i2
3          g1  bbaddd   i4

If you want NaN when you merge and there is no match, use the df3 example otherwise, use df4. For large datasets (millions of data points), this might not be efficient code.

edited Dec 19, 2018 at 19:34

answered Dec 19, 2018 at 19:13

Raj006

5624 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

G. Anderson Over a year ago

Good solution! Much more robust than mine

Raj006 Over a year ago

@G.Anderson Your initial approach was factored into my solution. So, +1 for you.

ayush Over a year ago

sample_df = {'account':['a','b','a','c','a'], 'return_type':['CTR','WO','T','CTR','WO'], 'inception_date':['x','x','x','x','x']} df = pd.DataFrame(sample_df) df_ctr = df[df['return_type']=='CTR'] df2 = df[df['account'].isin(df_ctr.account)] duplicate_t = df2[df2['return_type']=='T'] df_duplicate_removed = df[~df.isin(duplicate_t)].dropna() df4

Raj006 Over a year ago

@ayush, what are you trying to say. It wasn't clear from your comment. Are you providing one more dataset as an example?

G. Anderson · Accepted Answer · 2018-12-19 17:11:19Z

1

I would explicitly add a column for prefix, merge, then remove the column after: (Note, this only works if all the data is as you posted, with a 2-character prefix)

df1=pd.DataFrame({'folder_name':['f1','d1'],'name':['aa','bb']})
df2=pd.DataFrame({'name':['aa','bb','aaq','bbdfg'],'icon':[1,2,3,4]})

df2['prefix']=df2['name'].str[:2]

    name    icon    prefix
0   aa      1       aa
1   bb      2       bb
2   aaq     3       aa
3   bbdfg   4       bb

df3=df2.merge(df1, left_on='prefix', right_on='name').drop(['name_y','prefix'], axis=1)

df3

    name_x  icon    folder_name
0   aa      1       f1
1   aaq     3       f1
2   bb      2       d1
3   bbdfg   4       d1

edited Dec 19, 2018 at 17:11

answered Dec 19, 2018 at 16:54

G. Anderson

5,9652 gold badges16 silver badges22 bronze badges

4 Comments

Raj006 Over a year ago

This would work if the length of the folder_name is always two. The question did not make any such assumption.

G. Anderson Over a year ago

The question did not state otherwise, and the provided data supports the code as written. If there are other concerns regarding the prefix, I believe it would be a separate question, or a matter of adapting this code to the actual condition. I will adit in the condition, though

ayush Over a year ago

suffix can be of any length

Raj006 Over a year ago

If suffix (assuming 'name' from dataframe 2) can be of any length, but the 'name' form dataframe is always going to be of length 2, this is a better solution than mine due to better performance and simplicity. The solution I created did not assume the length of the 'name' in dataframe 1 to be equal to 2.

Collectives™ on Stack Overflow

Merge two pandas dataframes in python with a common column and with condition on string

2 Answers 2

4 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related