1

I am looking for an efficient way to combine 100 pandas data frames, which represent a grid of information points. Each of these data frames' points is unique, and does not overlap points represented by another, but they do share columns and rows over a larger patchwork space. i.e.

     1    2    3        4    5    6        7    8    9
A    df1, df1, df1,     df2, df2, df2,     df3, df3, df3
B    df1, df1, df1,     df2, df2, df2,     df3, df3, df3
C    df1, df1, df1,     df2, df2, df2,     df3, df3, df3

D    df4, df4, df4,     df5, df5, df5,     etc, etc, etc
E    df4, df4, df4,     df5, df5, df5,     etc, etc, etc
F    df4, df4, df4,     df5, df5, df5,     etc, etc, etc

Pandas' concatenate only combines over either the columns or the row axis, but not both. So I've been trying to increment over the data frames and using the df1.combine_first(df2) method (repeat ad infinitum).

Is this the best way to proceed, or is there another more efficient method that I should be aware of?

1
  • I'd split the dfs into 2 list the ones that can be combined column-wise and the ones that can be combined row-wise, concat those separately and then concat the 2 concatenated dfs Commented Apr 30, 2015 at 17:51

1 Answer 1

2

Here's a quick guess at both the convenience and efficiency angles, based on non-overlapping datapoints and assuming very regular data (everything 3x3 in this case).

df1=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('123') )
df2=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('123') )
df3=pd.DataFrame( np.random.randn(3,3), index=list('ABC'), columns=list('456') )
df4=pd.DataFrame( np.random.randn(3,3), index=list('DEF'), columns=list('456') )

The combine_first way has the advantage that you can just dump everything in a list without worrying about the order:

%%timeit
comb_df = pd.DataFrame()
for df in [df1,df2,df3,df4]:  
    comb_df = comb_df.combine_first( df )

100 loops, best of 3: 8.92 ms per loop

The concat way requires you to group things in a specific order, but is more than twice as fast:

%%timeit
df5 = pd.concat( [df1,df2], axis=0 )
df6 = pd.concat( [df3,df4], axis=0 )
df7 = pd.concat( [df5,df6], axis=1 )

100 loops, best of 3: 3.84 ms per loop

Quick check that both ways work the same:

all( comb_df == df7 )
True
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, nice answer. I'll go with the combine_first because due to some idiosyncrasies there is no easy and predictable way to join the data frames first in one direction, then the other. Fortunately, the combine_first method isn't quite as slow as I had originally anticipated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.