0

I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:

#df1
     c1    c2    c3
0    a     b     1
1    a     c     2
2    a     nan   1
3    b     nan   3
4    c     d     1
5    d     e     3

#df2
     c1    c2    c4
0    a     nan   1
1    a     c     2
2    a     x     1
3    b     nan   3
4    z     y     2

#merged output based on [c1, c2], dropping instances 
#with `NaN` unless both dataframes have `NaN`.

     c1    c2    c3    c4
0    a     b     1     1   #c1,c2 from df1 because df2 has a nan in c2
1    a     c     2     2   #in both
2    a     x     1     1   #c1,c2 from df2 because df1 has a nan in c2
3    b     nan   3     3   #c1,c2 as found in both
4    c     d     1     nan #from df1
5    d     e     3     nan #from df1
6    z     y     nan   2   #from df2

NaNs may come from either c1 or c2, but for this example I kept it simpler.

I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?

Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:

  • (a,c) has a match in both so it is no longer discussed.
  • (a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
  • (a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.
10
  • df1.combine_first(df2) Commented Jul 25, 2019 at 11:47
  • If your data is alligned as your example data, you can simply use: df1['c2'] = df1['c2'].fillna(df2['c2']) Commented Jul 25, 2019 at 11:47
  • Let me know if your data is alligned as you show in your question. If this is not the case neither mine or anky's answer will work. I will vote for reopen. Commented Jul 25, 2019 at 11:49
  • 1
    You have to test yourself, the rows have to be aligned as you show in your example dataset. If a sort will fix this, that means your data has the same amount of rows, plus the amount of keysL a, b etc. Commented Jul 25, 2019 at 11:59
  • 2
    your edit to the question introduces some major additional problems. Your expected output no longer makes sense in terms of a "merge" context, because if you were using merge, both tables have a ('a', nan) row, and shouldn't that get paired up? I think you really need to clarify for yourself some restrictions on how this would go, otherwise I don't see you asking for "one solution". There's consistency issues. Simply put, why should the 3rd row of df2 get combined with the 3rd row of df1? The first row of df2 matches much better with df1 Commented Jul 25, 2019 at 12:39

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.