18

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?

The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.

I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.

Here's what I've done:

Importing the Main data

import requests, zipfile, StringIO
import numpy as np
import pandas as pd 


STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'

r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))

STAR2013=pd.read_csv(z.open(STAR2013fileName))

Importing some Cross Cross Referencing Tables

STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))

STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))

STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))

Renaming a Column ID to Allow for Merge

STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013

Successful Merge

merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')

Try a second merge. This is where the Memory Overflow Happens

merged=pd.merge(merged, STARentityList2013, on='School Code')

I did all of this in ipython notebook, but don't think that changes anything.

5
  • check the shape of each data frame before and after the merge Commented Sep 23, 2015 at 23:26
  • 3
    I encountered this behavior when duplicates are present in the on key column. When I removed the duplicates everything worked fine. You can look at the DataFrame shape to check if you are getting any unexpected results. Commented Sep 23, 2015 at 23:26
  • One thing that may help is to put the creation of the DataFrames in functions, that way the strings and zipfiles can be closed and garbage collected (rather than sitting in memory). One question is how big do you expect the final merge to be? If you have a n row merging with an m row it can be n*m rows, which could be VERY big here. Commented Sep 24, 2015 at 0:20
  • Also the DataFrame is 200 million rows!!!! My machine suggests I'm already around 2.5Gb just reading that. Commented Sep 24, 2015 at 0:59
  • @ Alexander: using .size() ? @ yakym: can you give an example of that? @ Andy: 200 million rows is big, but isn't that the point of panda's? Better handling data and 'big data.' I like the merging within a function idea. Gotta collect more memory tricks like this. Commented Sep 24, 2015 at 15:30

1 Answer 1

9

Although this is an old question, I recently came across the same problem.

In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.

The method I came up with is as follows:

Calculate merge size:

def merge_size(left_frame, right_frame, group_by, how='inner'):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection

    left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
    right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan

    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_nan * right_nan]

    left_size = [left_groups[group_name] for group_name in left_diff]
    right_size = [right_groups[group_name] for group_name in right_diff]
    if how == 'inner':
        return sum(sizes)
    elif how == 'left':
        return sum(sizes + left_size)
    elif how == 'right':
        return sum(sizes + right_size)
    return sum(sizes + left_size + right_size)

Note:

At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.

If you are using a list of labels for the group_by, the final row size is:

min([merge_size(df1, df2, label, how) for label in group_by])

Check if this fits in memory

The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.

By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.

def mem_fit(df1, df2, key, how='inner'):
    rows = merge_size(df1, df2, key, how)
    cols = len(df1.columns) + (len(df2.columns) - 1)
    required_memory = (rows * cols) * np.dtype(np.float64).itemsize

    return required_memory <= psutil.virtual_memory().available

The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.