How to merge 2 dataframes based on two columns via string containment

Question

My question is somewhat similar to this one: How to merge pandas on string contains?, but I need a different output and the problem itself is a bit more complex. So I have 2 dataframes similar to the ones below:

df1 = pd.DataFrame({'ref_name':['city-louisville','city-louisville','city-louisville', 'town-lexington','town-lexington','town-lexington'], 'un_name1':['CPU1','CPU2','GPU1','CPU1','CPU2','GPU1'], 'value1':[10,15,28,12,14,14]})

df2 = pd.DataFrame({'ref_name':['louisville','louisville','lexington','lexington'], 'un_name2':['CPU','GPU','CPU','GPU'], 'value2':[25,28,26,14]})

I need to join based on ref_name and un_name based on the substrings within them. They won't always be as clean as this, but I figured it made for a decent little example. So my desired output in this case would look something like this:

ref_name  |  un_name1  |  un_name2  | value1  |  value2
---------------------------------------------------------
louisville|  CPU1      |  CPU       |  10     |  25
louisville|  CPU2      |  CPU       |  15     |  25
louisville|  GPU1      |  GPU       |  28     |  28
lexington |  CPU1      |  CPU       |  12     |  26
lexington |  CPU2      |  CPU       |  14     |  26
lexington |  GPU1      |  GPU       |  14     |  14

Thanks in advance for any help on this!

Code Different · Accepted Answer · 2020-05-21 02:06:15Z

2

Here's the most generic version that I can think of. Performance may be a problem if your dataframes are big.

mask1 = df2['ref_name'].apply(lambda value: df1['ref_name'].str.contains(value))
mask2 = df2['un_name2'].apply(lambda value: df1['un_name1'].str.contains(value))
mask = (mask1 & mask2).stack().rename_axis(['index2', 'index1'])
mask = mask[mask].index.to_frame(False)

result = mask.merge(df2, left_on='index2', right_index=True) \
             .merge(df1, left_on='index1', right_index=True)

Result:

 index2  index1  ref_name_x un_name2  value2       ref_name_y un_name1  value1
      0       0  louisville      CPU      25  city-louisville     CPU1      10
      0       1  louisville      CPU      25  city-louisville     CPU2      15
      1       2  louisville      GPU      28  city-louisville     GPU1      28
      2       3   lexington      CPU      26   town-lexington     CPU1      12
      2       4   lexington      CPU      26   town-lexington     CPU2      14
      3       5   lexington      GPU      14   town-lexington     GPU1      14

Trimming / renaming the columns is an excercise left for the OP.

edited May 21, 2020 at 2:06

answered May 20, 2020 at 22:06

Code Different

93.4k16 gold badges154 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

obewanjacobi Over a year ago

Perfect, thank you!! Yes, trimming the excess columns is no problem at all, data lines up perfectly both in the above example and my messy dataset. Also feel like I should note that it didn't take too terribly long to run. For 1 dataframe with 1000 entries and the other with >6000 it took less than 10 seconds.

Collectives™ on Stack Overflow

How to merge 2 dataframes based on two columns via string containment

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related