5

I would like to merge nine Pandas dataframes together into a single dataframe, doing a join on two columns, controlling the column names. Is this possible?

I have nine datasets. All of them have the following columns:

org, name, items,spend

I want to join them into a single dataframe with the following columns:

org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3...

I've been reading the documentation on merging and joining. I can currently merge two datasets together like this:

ad = pd.DataFrame.merge(df_presents, df_trees,
                        on=['practice', 'name'],
                        suffixes=['_presents', '_trees'])

This works great, doing print list(aggregate_data.columns.values) shows me the following columns:

[org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...]

But how can I do this for nine columns? merge only seems to accept two at a time, and if I do it sequentially, my column names are going to end up very messy.

2
  • Just found this stackoverflow.com/questions/24853762/… but I'm not sure it works for my example - I guess I need to concatenate, then merge somehow? I want to write my output to a BigQuery table, so I don't know if hierarchical dataframes will work for me. Commented Dec 17, 2015 at 16:00
  • Found this stackoverflow.com/questions/23668427/… which is more promising, but doesn't explain how to control the column names. Commented Dec 17, 2015 at 16:12

3 Answers 3

6

You could use functools.reduce to iteratively apply pd.merge to each of the DataFrames:

result = functools.reduce(merge, dfs)

This is equivalent to

result = dfs[0]
for df in dfs[1:]:
    result = merge(result, df)

To pass the on=['org', 'name'] argument, you could use functools.partial define the merge function:

merge = functools.partial(pd.merge, on=['org', 'name'])

Since specifying the suffixes parameter in functools.partial would only allow one fixed choice of suffix, and since here we need a different suffix for each pd.merge call, I think it would be easiest to prepare the DataFrames column names before calling pd.merge:

for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)

For example,

import pandas as pd
import numpy as np
import functools
np.random.seed(2015)

N = 50
dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)), 
                    columns=['org', 'name', 'items', 'spend']) for i in range(9)]
for i, df in enumerate(dfs, start=1):
    df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, 
              inplace=True)
merge = functools.partial(pd.merge, on=['org', 'name'])
result = functools.reduce(merge, dfs)
print(result.head())

yields

   org  name  items_df1  spend_df1  items_df2  spend_df2  items_df3  \
0    2     4          4          2          3          0          1   
1    2     4          4          2          3          0          1   
2    2     4          4          2          3          0          1   
3    2     4          4          2          3          0          1   
4    2     4          4          2          3          0          1   

   spend_df3  items_df4  spend_df4  items_df5  spend_df5  items_df6  \
0          3          1          0          1          0          4   
1          3          1          0          1          0          4   
2          3          1          0          1          0          4   
3          3          1          0          1          0          4   
4          3          1          0          1          0          4   

   spend_df6  items_df7  spend_df7  items_df8  spend_df8  items_df9  spend_df9  
0          3          4          1          3          0          1          2  
1          3          4          1          3          0          0          3  
2          3          4          1          3          0          0          0  
3          3          3          1          3          0          1          2  
4          3          3          1          3          0          0          3  
Sign up to request clarification or add additional context in comments.

2 Comments

This is great! Thank you very much! It unfortunately is an inner join however, how would you deal with an outer join?
@jeangelj: You can add how='outer' to the functools.partial call: e.g. merge = functools.partial(pd.merge, on=['org', 'name'], how='outer').
0

Would doing a big pd.concat() and then renaming all the columns work for you? Something like:

desired_columns = ['items', 'spend']
big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1)


new_columns = ['org', 'name']
for i in range(num_dataframes):
    new_columns.extend(['spend_df%i' % i, 'items_df%i' % i])

bid_df.columns = new_columns

This should give you columns like:

org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

Comments

0

I've wanted this as well at times but been unable to find a built-in pandas way of doing it. Here is my suggestion (and my plan for the next time I need it):

  1. Create an empty dictionary, merge_dict.
  2. Loop through the index you want for each of your data frames and add the desired values to the dictionary with the index as the key.
  3. Generate a new index as sorted(merge_dict).
  4. Generate a new list of data for each column by looping through merge_dict.items().
  5. Create a new data frame with index=sorted(merge_dict) and columns created in the previous step.

Basically, this is somewhat like a hash join in SQL. Seems like the most efficient way I can think of and shouldn't take too long to code up.

Good luck.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.