I have two pandas DataFrames df1 and df2 with a fairly standard format:
one two three feature
A 1 2 3 feature1
B 4 5 6 feature2
C 7 8 9 feature3
D 10 11 12 feature4
E 13 14 15 feature5
F 16 17 18 feature6
...
And the same format for df2. The sizes of these DataFrames are around 175MB and 140 MB.
merged_df = pd.merge(df1, df2, on='feature', how='outer', suffixes=('','_features'))
I get the following MemoryError:
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas/src/join.pyx", line 187, in pandas.algos.full_outer_join (pandas/algos.c:61680)
File "pandas/src/join.pyx", line 196, in pandas.algos._get_result_indexer (pandas/algos.c:61978)
MemoryError
Is it possible there is a "size limit" for pandas dataframes when merging? I am surprised that this wouldn't work. Maybe this is a bug in a certain version of pandas?
EDIT: As mentioned in the comments, many duplicates in the merge column can easily cause RAM issues. See: Python Pandas Merge Causing Memory Overflow
The question now is, how can we do this merge? It seems the best way would be to partition the dataframe, somehow.
featurecolumn? if there are a lot of duplicates, your join could end up being very largedf1has 10 millions rows, and feature1 has 500K rows, feature2 has 500K rows, etc. The dataframe itself is only 150 MB---why would there be a memory error?