2

My question is somewhat similar to this one: How to merge pandas on string contains?, but I need a different output and the problem itself is a bit more complex. So I have 2 dataframes similar to the ones below:

df1 = pd.DataFrame({'ref_name':['city-louisville','city-louisville','city-louisville', 'town-lexington','town-lexington','town-lexington'], 'un_name1':['CPU1','CPU2','GPU1','CPU1','CPU2','GPU1'], 'value1':[10,15,28,12,14,14]})

df2 = pd.DataFrame({'ref_name':['louisville','louisville','lexington','lexington'], 'un_name2':['CPU','GPU','CPU','GPU'], 'value2':[25,28,26,14]})

I need to join based on ref_name and un_name based on the substrings within them. They won't always be as clean as this, but I figured it made for a decent little example. So my desired output in this case would look something like this:

ref_name  |  un_name1  |  un_name2  | value1  |  value2
---------------------------------------------------------
louisville|  CPU1      |  CPU       |  10     |  25
louisville|  CPU2      |  CPU       |  15     |  25
louisville|  GPU1      |  GPU       |  28     |  28
lexington |  CPU1      |  CPU       |  12     |  26
lexington |  CPU2      |  CPU       |  14     |  26
lexington |  GPU1      |  GPU       |  14     |  14

Thanks in advance for any help on this!

1 Answer 1

2

Here's the most generic version that I can think of. Performance may be a problem if your dataframes are big.

mask1 = df2['ref_name'].apply(lambda value: df1['ref_name'].str.contains(value))
mask2 = df2['un_name2'].apply(lambda value: df1['un_name1'].str.contains(value))
mask = (mask1 & mask2).stack().rename_axis(['index2', 'index1'])
mask = mask[mask].index.to_frame(False)

result = mask.merge(df2, left_on='index2', right_index=True) \
             .merge(df1, left_on='index1', right_index=True)

Result:

 index2  index1  ref_name_x un_name2  value2       ref_name_y un_name1  value1
      0       0  louisville      CPU      25  city-louisville     CPU1      10
      0       1  louisville      CPU      25  city-louisville     CPU2      15
      1       2  louisville      GPU      28  city-louisville     GPU1      28
      2       3   lexington      CPU      26   town-lexington     CPU1      12
      2       4   lexington      CPU      26   town-lexington     CPU2      14
      3       5   lexington      GPU      14   town-lexington     GPU1      14

Trimming / renaming the columns is an excercise left for the OP.

Sign up to request clarification or add additional context in comments.

1 Comment

Perfect, thank you!! Yes, trimming the excess columns is no problem at all, data lines up perfectly both in the above example and my messy dataset. Also feel like I should note that it didn't take too terribly long to run. For 1 dataframe with 1000 entries and the other with >6000 it took less than 10 seconds.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.