Combining pandas data frames with overlapping columns / rows

Question

I am looking for an efficient way to combine 100 pandas data frames, which represent a grid of information points. Each of these data frames' points is unique, and does not overlap points represented by another, but they do share columns and rows over a larger patchwork space. i.e.

     1    2    3        4    5    6        7    8    9
A    df1, df1, df1,     df2, df2, df2,     df3, df3, df3
B    df1, df1, df1,     df2, df2, df2,     df3, df3, df3
C    df1, df1, df1,     df2, df2, df2,     df3, df3, df3

D    df4, df4, df4,     df5, df5, df5,     etc, etc, etc
E    df4, df4, df4,     df5, df5, df5,     etc, etc, etc
F    df4, df4, df4,     df5, df5, df5,     etc, etc, etc

Pandas' concatenate only combines over either the columns or the row axis, but not both. So I've been trying to increment over the data frames and using the df1.combine_first(df2) method (repeat ad infinitum).

Is this the best way to proceed, or is there another more efficient method that I should be aware of?

I'd split the dfs into 2 list the ones that can be combined column-wise and the ones that can be combined row-wise, concat those separately and then concat the 2 concatenated dfs — EdChum
– EdChum, Commented Apr 30, 2015 at 17:51

JohnE · Accepted Answer · 2015-04-30 22:39:06Z

2

Here's a quick guess at both the convenience and efficiency angles, based on non-overlapping datapoints and assuming very regular data (everything 3x3 in this case).

df1=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('123') )
df2=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('123') )
df3=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('456') )
df4=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('456') )

The combine_first way has the advantage that you can just dump everything in a list without worrying about the order:

%%timeit
comb_df = pd.DataFrame()
for df in [df1,df2,df3,df4]:  
    comb_df = comb_df.combine_first( df )

100 loops, best of 3: 8.92 ms per loop

The concat way requires you to group things in a specific order, but is more than twice as fast:

%%timeit
df5 = pd.concat( [df1,df2], axis=0 )
df6 = pd.concat( [df3,df4], axis=0 )
df7 = pd.concat( [df5,df6], axis=1 )

100 loops, best of 3: 3.84 ms per loop

Quick check that both ways work the same:

all( comb_df == df7 )
True

edited Apr 30, 2015 at 22:39

answered Apr 30, 2015 at 21:45

JohnE

30.7k9 gold badges86 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

songololo Over a year ago

Thanks, nice answer. I'll go with the combine_first because due to some idiosyncrasies there is no easy and predictable way to join the data frames first in one direction, then the other. Fortunately, the combine_first method isn't quite as slow as I had originally anticipated.

Collectives™ on Stack Overflow

Combining pandas data frames with overlapping columns / rows

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related