How to merge many DataFrames by index combining values where columns overlap?

Question

I have many DataFrames that I need to merge.

Let's say:

base: id  constraint
      1   'a'
      2   'b'
      3   'c'

df_1: id value constraint
      1  1     'a'
      2  2     'a'
      3  3     'a'

df_2: id value constraint
      1  1     'b'
      2  2     'b'
      3  3     'b'


df_3: id value constraint
      1  1     'c'
      2  2     'c'
      3  3     'c'

If I try and merge all of them (it'll be in a loop), I get:

a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')

id constraint value   value_x  value_y
1  'a'        1       NaN      NaN
2  'b'        NaN     2        NaN
3  'c'        NaN     NaN      3

The desired output would be:

id constraint value
1  'a'        1 
2  'b'        2
3  'c'        3

I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.

Is there a merge that can replace values in case of columns overlap?

It's somewhat similar to this question, with no answers.

jlandercy · Accepted Answer · 2019-10-07 18:52:36Z

Given your MCVE:

import pandas as pd

base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])

I would suggest to concat first your dataframe (using a loop if needed):

df = pd.concat([df1, df2, df3])

And then merge:

pd.merge(base, df, on='id')

It yields:

   id  value
0   1      1
1   2      2
2   3      3

Update

Runing the code with the new version of your question and the input provided by @Celius Stingher:

a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)

We get:

   id constrains  value
0   1          a      1
1   2          b      2
2   3          c      3

Which seems to be compliant with your expected output.

Quang Hoang · Accepted Answer · 2019-10-07 19:17:08Z

3

You can use ffill() for the purpose:

df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])

(pd.concat((df_1,df_2,df_3), axis=1)
   .ffill(1)
   .iloc[:,-1]
)

Output:

1    1.0
2    2.0
3    3.0
Name: val, dtype: float64

For your new data:

base.merge(pd.concat((df1,df2,df3)),
           on=['id','constraint'],
           how='left')

output:

   id constraint  value
0   1        'a'      1
1   2        'b'      2
2   3        'c'      3

Conclusion: you are actually looking for the option how='left' in merge

edited Oct 7, 2019 at 19:17

answered Oct 7, 2019 at 18:33

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

If you must only merge all dataframes with base:

Based on edit

import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)

dataframes = [df_1,df_2,df_3]
for i in dataframes:
    base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)

Output:

   id constrains  value
0   1          a    1.0
1   2          b    2.0
2   3          c    3.0

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Oct 7, 2019 at 18:35

Celius Stingher

18.4k6 gold badges26 silver badges54 bronze badges

Comments

Gustavo Lopes · Accepted Answer · 2019-10-08 15:34:22Z

For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.

Documented version is on the original gist.

import pandas as pa

def rmerge(left,right,**kwargs):
    # Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
    def flatten(lst):
        return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )

    # Set default for removing overlapping columns in "left" to be true
    myargs = {'replace':'left'}
    myargs.update(kwargs)

    # Remove the replace key from the argument dict to be sent to
    # pandas merge command
    kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}

    if myargs['replace'] is not None:
        # Generate a list of overlapping column names not associated with the join
        skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
        leftcols = set(left.columns)
        rightcols = set(right.columns)
        dropcols = list((leftcols & rightcols).difference(skipcols))

        # Remove the overlapping column names from the appropriate DataFrame
        if myargs['replace'].lower() == 'left':
            left = left.copy().drop(dropcols,axis=1)
        elif myargs['replace'].lower() == 'right':
            right = right.copy().drop(dropcols,axis=1)

    df = pa.merge(left,right,**kwargs)

    return df

Collectives™ on Stack Overflow

How to merge many DataFrames by index combining values where columns overlap?

4 Answers 4

Update

Comments

Comments

Based on edit

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Update

Comments

Comments

Based on edit

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related