How do I release memory used by a pandas dataframe but not slices?

Question

As noted in this question is possible to explicitly release the memory of a dataframe. I am running into an issue which is a bit of an extension to that problem. I often import a whole data set and do a selection on it. The selections tend to come in two forms:

df_row_slice = df.sample(frac=0.6)
df_column_slice = df[columns]

Past some point in my code I know that I will no longer make any reference to the original df. Is there a way to release all the memory which is not referenced by the slices? I realize I could .copy() when I slice but this temporary duplication would cause me to exceed my memory.

UPDATE

Following the reply I think the method would be to drop the columns or rows from the original frame.

df_column_slice = df[columns]
cols_to_drop = [i for i in df.columns if i not in columns]
df = df.drop(columns=cols_to_drop)

or

df_row_slice = df.sample(frac=0.6)
df = df.drop(df_row_slice.index)

Hopefully the garbage collection then works properly to free up the memory. Would it be smart to call

import gc
gc.collect()

just to be safe? Does the order matter? I could drop before the slicing without problem. In my specific case, I make several slices of both types. My hope would be that I could del df and memory management would do something like this under the hood.

M_Gorky · Accepted Answer · 2018-07-31 16:43:30Z

2

You can use df.drop to remove unused columns and rows.

import os, psutil, numpy as np
def usage():
    process = psutil.Process(os.getpid())
    return process.memory_info()[0] / float(2 ** 20)

df_all = pd.read_csv('../../../Datasets/Trial.csv', index_col=None)
usage()

cols_to_drop = df_all.loc[:5,'Col3':].columns.values
df_all = df_all.drop(columns=cols_to_drop)
usage()

Here first usage() returns 357 and second returns 202 for me.

If you need to have df_row_slice and df_column_slice at the same time, you can do this:

cols_to_drop = df_all.loc[:5,'Col3':].columns.values
rows_to_drop = np.random.choice(df.index.values, int(df.shape[0]*0.4))
df_row_slice = df.drop(rows_to_drop)
df = df.drop(columns=cols_to_drop)
df_column_slice = df

Here df_column_slice is just another view of the same dataframe.

edited Jul 31, 2018 at 16:43

answered Jul 31, 2018 at 7:13

M_Gorky

581 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Keith Over a year ago

OK, that will get us part of the way. If I dropped all the columns from df would that remove them from df_column_slice? What if I dropped the whole dataframe?

Collectives™ on Stack Overflow

How do I release memory used by a pandas dataframe but not slices?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related