Outer merge on large pandas DataFrames causes MemoryError---how to do "big data" merges with pandas?

Question

I have two pandas DataFrames df1 and df2 with a fairly standard format:

   one  two  three   feature
A    1    2      3   feature1
B    4    5      6   feature2  
C    7    8      9   feature3   
D    10   11     12  feature4
E    13   14     15  feature5 
F    16   17     18  feature6 
...

And the same format for df2. The sizes of these DataFrames are around 175MB and 140 MB.

merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))

I get the following MemoryError:

File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how) 
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
  File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError

Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?

EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow

The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.

are there duplicates in your feature column? if there are a lot of duplicates, your join could end up being very large — maxymoo
– maxymoo, Commented Oct 3, 2016 at 5:22
@maxymoo Yes. Can you explain why this would exceed RAM limits? Let's say df1 has 10 millions rows, and feature1 has 500K rows, feature2 has 500K rows, etc. The dataframe itself is only 150 MB---why would there be a memory error? — ShanZhengYang
– ShanZhengYang, Commented Oct 3, 2016 at 5:31
@SayaliSonawane I understand now. No, there are no NaN values here :) — ShanZhengYang
– ShanZhengYang, Commented Oct 3, 2016 at 5:50
@SayaliSonawane I think you are right---but I still need a solution. One cannot simply delete these rows. What is the standard way to solve this? — ShanZhengYang
– ShanZhengYang, Commented Oct 3, 2016 at 6:56

jezrael · Accepted Answer · 2016-10-03 07:47:01Z

2

You can try first filter df1 by unique values, merge and last concat output.

If need only outer join, I think there is memory problem also. But if add some other code for filter output of each loop, it can works.

dfs = []
for val in df.feature.unique():
    df1 = pd.merge(df[df.feature==val], df2, on='feature', how='outer', suffixes=('','_key'))
    #http://stackoverflow.com/a/39786538/2901002
    #df1 = df1[(df1.start <= df1.start_key) & (df1.end <= df1.end_key)]
    print (df1)
    dfs.append(df1)

df = pd.concat(dfs, ignore_index=True)
print (df)

Other solution is use dask.dataframe.DataFrame.merge.

edited Oct 3, 2016 at 7:47

answered Oct 3, 2016 at 5:55

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

omdurg Over a year ago

I am trying to merge (left join) two large dataframes, memory error occurs, please could you let me know which method would be suitable @jezrael to merge without memory error

Greg Miller · Accepted Answer · 2016-10-03 05:36:28Z

1

Try specifying a data type for the numeric columns to reduce the size of the existing data frames, such as:

df[['one','two', 'three']] = df[['one','two', 'three']].astype(np.int32)

This should reduce the memory significantly and will hopefully let you preform the merge.

answered Oct 3, 2016 at 5:36

Greg Miller

3051 gold badge3 silver badges15 bronze badges

2 Comments

ShanZhengYang Over a year ago

I'm guessing this might not help that much for very large joins, as mentioned above. Is there an efficient way to split up the dataframe by unique feature? If there would (let's say) 10 features with a 100 MB dataframe, you might get significantly smaller dataframes (if uniform, around 10 MB each)

Greg Miller Over a year ago

I'm not sure of an efficient way to split up the dataframe the way you're suggesting, but it would be possible to remove all columns that are shared between the two initial dataframes. The goal is an outer join, so removing columns that aren't unique to the dataframe wouldn't affect the outcome, but would decrease memory. Much like the answer @jezrael has just posted.

Collectives™ on Stack Overflow

Outer merge on large pandas DataFrames causes MemoryError---how to do "big data" merges with pandas?

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related