1

I have two pandas DataFrames df1 and df2 with a fairly standard format:

   one  two  three   feature
A    1    2      3   feature1
B    4    5      6   feature2  
C    7    8      9   feature3   
D    10   11     12  feature4
E    13   14     15  feature5 
F    16   17     18  feature6 
...

And the same format for df2. The sizes of these DataFrames are around 175MB and 140 MB.

merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))

I get the following MemoryError:

File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how) 
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
  File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError

Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?

EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow

The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.

13
  • are there duplicates in your feature column? if there are a lot of duplicates, your join could end up being very large Commented Oct 3, 2016 at 5:22
  • @maxymoo Yes. Can you explain why this would exceed RAM limits? Let's say df1 has 10 millions rows, and feature1 has 500K rows, feature2 has 500K rows, etc. The dataframe itself is only 150 MB---why would there be a memory error? Commented Oct 3, 2016 at 5:31
  • Are you using 32 bit python/64 bit? Commented Oct 3, 2016 at 5:33
  • 1
    @SayaliSonawane I understand now. No, there are no NaN values here :) Commented Oct 3, 2016 at 5:50
  • 1
    @SayaliSonawane I think you are right---but I still need a solution. One cannot simply delete these rows. What is the standard way to solve this? Commented Oct 3, 2016 at 6:56

2 Answers 2

2

You can try first filter df1 by unique values, merge and last concat output.

If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.

dfs = []
for val in df.feature.unique():
    df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
    #http://stackoverflow.com/a/39786538/2901002
    #df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
    print (df1)
    dfs.append(df1)

df = pd.concat(dfs, ignore_index=True)
print (df)

Other solution is use dask.dataframe.DataFrame.merge.

Sign up to request clarification or add additional context in comments.

1 Comment

I am trying to merge (left join) two large dataframes, memory error occurs, please could you let me know which method would be suitable @jezrael to merge without memory error
1

Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:

df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)

This should reduce the memory significantly and will hopefully let you preform the merge.

2 Comments

I'm guessing this might not help that much for very large joins, as mentioned above. Is there an efficient way to split up the dataframe by unique feature? If there would (let's say) 10 features with a 100 MB dataframe, you might get significantly smaller dataframes (if uniform, around 10 MB each)
I'm not sure of an efficient way to split up the dataframe the way you're suggesting, but it would be possible to remove all columns that are shared between the two initial dataframes. The goal is an outer join, so removing columns that aren't unique to the dataframe wouldn't affect the outcome, but would decrease memory. Much like the answer @jezrael has just posted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.