2

I have many DataFrames that I need to merge.

Let's say:

base: id  constraint
      1   'a'
      2   'b'
      3   'c'

df_1: id value constraint
      1  1     'a'
      2  2     'a'
      3  3     'a'

df_2: id value constraint
      1  1     'b'
      2  2     'b'
      3  3     'b'


df_3: id value constraint
      1  1     'c'
      2  2     'c'
      3  3     'c'

If I try and merge all of them (it'll be in a loop), I get:

a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value   value_x  value_y
1  'a'        1       NaN      NaN
2  'b'        NaN     2        NaN
3  'c'        NaN     NaN      3

The desired output would be:

id constraint value
1  'a'        1 
2  'b'        2
3  'c'        3

I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.

Is there a merge that can replace values in case of columns overlap?

It's somewhat similar to this question, with no answers.

0

4 Answers 4

3

Given your MCVE:

import pandas as pd

base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])

I would suggest to concat first your dataframe (using a loop if needed):

df = pd.concat([df1, df2, df3])

And then merge:

pd.merge(base, df, on='id')

It yields:

   id  value
0   1      1
1   2      2
2   3      3

Update

Runing the code with the new version of your question and the input provided by @Celius Stingher:

a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)

We get:

   id constrains  value
0   1          a      1
1   2          b      2
2   3          c      3

Which seems to be compliant with your expected output.

Sign up to request clarification or add additional context in comments.

Comments

3

You can use ffill() for the purpose:

df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])

(pd.concat((df_1,df_2,df_3), axis=1)
   .ffill(1)
   .iloc[:,-1]
)

Output:

1    1.0
2    2.0
3    3.0
Name: val, dtype: float64

For your new data:

base.merge(pd.concat((df1,df2,df3)),
           on=['id','constraint'],
           how='left')

output:

   id constraint  value
0   1        'a'      1
1   2        'b'      2
2   3        'c'      3

Conclusion: you are actually looking for the option how='left' in merge

Comments

1

If you must only merge all dataframes with base:

Based on edit

import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)

dataframes = [df_1,df_2,df_3]
for i in dataframes:
    base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)

Output:

   id constrains  value
0   1          a    1.0
1   2          b    2.0
2   3          c    3.0

Comments

0

For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.

Documented version is on the original gist.

import pandas as pa

def rmerge(left,right,**kwargs):
    # Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
    def flatten(lst):
        return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )

    # Set default for removing overlapping columns in "left" to be true
    myargs = {'replace':'left'}
    myargs.update(kwargs)

    # Remove the replace key from the argument dict to be sent to
    # pandas merge command
    kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}

    if myargs['replace'] is not None:
        # Generate a list of overlapping column names not associated with the join
        skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
        leftcols = set(left.columns)
        rightcols = set(right.columns)
        dropcols = list((leftcols & rightcols).difference(skipcols))

        # Remove the overlapping column names from the appropriate DataFrame
        if myargs['replace'].lower() == 'left':
            left = left.copy().drop(dropcols,axis=1)
        elif myargs['replace'].lower() == 'right':
            right = right.copy().drop(dropcols,axis=1)

    df = pa.merge(left,right,**kwargs)

    return df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.