1

General Issue

I have an arbitrary list of pandas.DataFrame's (let's use 2 to keep the example clear), and I want to concat them on an Index that:

  1. is neither the inner nor the outer join of the existing DataFrames
  2. is a different, separate Index, but only has dates within all the DataFrame's

For example, take the following 2 DataFrame's (note the difference in Index shapes):

In [01]: d1 = pandas.DataFrame( numpy.random.randn(15, 4), 
                                columns = ['a', 'b', 'c', 'd'], 
                                index = pandas.DatetimeIndex(start = '01/01/2001', 
                                                             freq = 'b', 
                                                             periods = 15)
          )

In [02]: d2 = pandas.DataFrame( numpy.random.randn(17, 4), 
                                columns = ['e', 'f', 'g', 'h'], 
                                index = pandas.DatetimeIndex(start = '01/05/2001', 
                                                             freq = 'b', 
                                                             periods = 17)
          )

I would like to join these two DataFrame's on an intersecting Index, such my_index, constructed here:

In [03]: ind = range(0, 10, 2)
In [04]: my_index = d2.index[ind].copy()

So the following result should have the same results as:

In [05]: d1.loc[my_index, :].join(d2.loc[my_index, :] )
Out[65]: 
               a         b         c         d         e         f  \
2001-01-05  1.702556 -0.885554  0.766257 -0.731700 -1.071232  1.806680   
2001-01-09 -0.968689 -0.700311  1.024988 -0.705764  0.804285 -0.337177   
2001-01-11  1.249893 -0.613356  1.975736 -0.093838  0.428004  0.634204   
2001-01-15  0.430000  0.502100  0.194092  0.588685 -0.507332  1.404635   
2001-01-17  1.005721  0.604771 -2.296667  0.157201  1.583537  1.359332   

               g         h  
2001-01-05 -1.183528  1.260880  
2001-01-09  0.352487  0.700853  
2001-01-11  1.060694  0.040667  
2001-01-15 -0.044510  0.565152  
2001-01-17 -0.731624 -0.331027  

Personal Considerations

Because this is for a larger application, and I will have an arbitrary number of DataFrame's I'd like to:

  1. Use existing pandas functionality instead of building my own hack, i.e. reduce( map ( ) ) etc.
  2. Return views of the intersection of the DataFrame's instead of creating copies of the DataFrame's

2 Answers 2

2

I don't think there is an out-of-the-box Pandas function for doing this. However, it's not hard to build your own:

def select_join(dfs, index):
    result = dfs[0].reindex(index)
    for df in dfs[1:]:
        result = result.join(df, how='inner')
    return result

For example,

import numpy as np
import pandas as pd
import string
import itertools as IT

columns = iter(string.letters)
dfs = []
for i in range(3):
    d1 = pd.DataFrame( np.random.randn(15, 4), 
                           columns = list(IT.islice(columns, 4)), 
                           index = pd.DatetimeIndex(start = '01/01/2001', 
                                                    freq = 'b', 
                                                    periods = 15))
    dfs.append(d1)

ind = range(0, 10, 2)
my_index = d1.index[ind].copy()
print(select_join(dfs, my_index))

yields

                   a         b         c         d         e         f  \
2001-01-01  0.228430 -1.154375 -0.612703 -2.760826 -0.877355 -0.071581   
2001-01-03  1.452750  1.341027  0.051486  1.231563  0.428353  1.320172   
2001-01-05 -0.966979 -1.997200 -0.376060 -0.692346 -1.689897  0.549653   
2001-01-09 -0.117443 -0.888103  2.092829 -0.467220 -1.083004 -1.443015   
2001-01-11 -0.168980 -0.152663  0.365618  0.444175 -1.472091 -0.578182   

                   g         h         i         j         k         l  
2001-01-01 -0.098758  0.920457 -1.072377 -0.627720  0.223060  0.903130  
2001-01-03  1.962124  1.134501 -0.209813 -2.309090  0.358121  0.655156  
2001-01-05  1.088195 -1.705393 -0.161167 -0.339617  0.945495  0.220701  
2001-01-09  0.970829  1.931192  0.943150 -1.895580  0.815188 -1.485206  
2001-01-11  0.747193 -1.221069 -0.164531 -0.395197 -0.754051  0.922090  

Regarding the second consideration: It is impossible to return a view if index is arbitrary. The DataFrame stores data (of like dtype) in a NumPy array. When you select arbitrary rows from a NumPy array, space for a new array is allocated and the rows are copied from the original array into the new array. Only when the selection can be expressed as a basic slice is a view returned. This limitation of NumPy -- a very hard limitation to remove! -- bubbles up into Pandas, causing DataFrames to return copies when the index is not expressible as a basic slice.

Sign up to request clarification or add additional context in comments.

4 Comments

I'm really surprised there isn't out of the box functionality for this (because it seems like trivial use case). Before I RTD I thought concat( [df_1, df_2], join_axes = my_axis) was the specific functionality I was looking for, however, you most certainly would know! Thanks for the response @unutbu!
For completeness, the fastest implementation I could come up with was: def join_on_index(df_list, index): return pandas.concat( map( lambda x: x.reindex(index), df_list), axis = 1)
Interesting! Feel free to post that as an answer (and accept it if you find that's the best solution.) The reason why I chose to avoid concat here is because it can raise an error if the index contains duplicates while join does not.
Then maybe I'll post it as an answer for completeness, but I will certainly only accept yours as the correct answer :-)
1

Different Methods & Their Times (for Completeness)

I've accepted @unutbu's answer, but I thought it might be valuable to show the two functions I created (and @unutbu's) and their different %timeitvalues in case anyone wants to use it:

Create the df_list and my_index:

dfs = []
for i in range(5):
    tmp = pandas.DataFrame( numpy.random.randn(1000, 4), 
                            columns = list(itertools.islice(columns, 4)), 
                            index = pandas.DatetimeIndex(start = '01/01/2000', 
                                                         freq = 'b', 
                                                         periods = 1000)
    )

    dfs.append(tmp)

ind = range(0, 1000, 2)
my_index = tmp.index[ind].copy()

3 Different Implementations

def join_on_index_a(df_list, index):
    return pandas.concat( 
                          map( lambda x: x.reindex(index), df_list), 
                          axis = 1
    )

#@unutbu's implementation
def join_on_index_b(df_list, index):
    result = dfs[0].reindex(index)
    for df in dfs[1:]:
        result = result.join(df, how='inner')
    return result

def join_on_index_c(df_list, index):
    return pandas.concat( map( lambda x: x.loc[index, :], df_list), axis = 1)

The Results Using iPython %timeit

In [49]: %timeit join_on_index_a(dfs, my_index)
1000 loops, best of 3: 1.85 ms per loop

In [50]: %timeit join_on_index_b(dfs, my_index)
100 loops, best of 3: 1.94 ms per loop

In [51]: %timeit join_on_index_c(dfs, my_index)
100 loops, best of 3: 21.5 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.