2

My problem is kind of a bit tricky (similar to sql merge/update), and not understanding how to fix: ( I am giving a small sample of the dataframes below)

I have two dataframes :

  dfOld 

     A   B  C  D  E
     x1  x2  g  h  r
     q1  q2  x  y  s
     t1  t2  h  j  u
     p1  p2  r  s  t

AND

 dfNew 

          A   B  C  D   E
          x1  x2  a  b  c
          s1  s2  p  q  r
          t1  t2  h  j  u
          q1  q2  x  y  z

We want to merge the dataframes with the following rule : ( we can think Col A & ColB as keys)

For any ColA & ColB combination if C/D/E are exact match then it takes value from any dataframe, however if any value has changed in Col C/D/E , it takes the value from new dataframe and if a new ColA/Col B combination is in DfNew then it takes those values and if the ColA/ColB combination does not exist in dfNew then it takes the value from dfOld:

So my OutPut should be like:

            A   B  C  D   E
            x1  x2  a  b  c
            q1  q2  x  y  z
            t1  t2  h  j  u
            p1  p2  r  s  t
            s1  s2  p  q  r

I was trying :

    mydfL = (df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])
    mydfR = (df1.merge(df,indicator = True, how='left').loc[lambda x : x['_merge']!='both'])


    dfO = pd.concat([mydfL,mydfR])

    dfO.drop("_merge", axis=1, inplace=True)

My output looks like: ( I kept the index for clarity)

            A   B  C  D  E
        0  x1  x2  a  b  c
        2  s1  s2  p  q  r
        3  q1  q2  x  y  z
        0  x1  x2  g  h  r
        2  q1  q2  x  y  s
        3  p1  p2  r  s  t

However, this output does not serve my purpose. First and foremost it does not include the totally identical row (between dfOld & dfnew) which consists of :

          t1  t2  h  j  u

and next it includes all the rows where for the ColA/Col x, y and q1, q2, where I just wanted the updated values in ColC/D/E in the new data frame ( dfNew). It includes data from both.

So can I get some help as to what am I missing and what may be a better and elegant way to do this. Thanks in advance.

1 Answer 1

1

You can use combine_first using A/B as temporary index:

out = (dfNew.set_index(['A', 'B'])
            .combine_first(dfOld.set_index(['A', 'B']))
            .reset_index()
      )
Sign up to request clarification or add additional context in comments.

2 Comments

Wanted to ask you though @mozway, for very big dataframes, can we do a similar thing using vaex or dask or similar?
@Stan I don't know, you should ask with the proper tags ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.