0

I have a large pandas dataframe df of something like a million rows and 100 columns, and I have to create a second dataframe df_n, same size as the first one. Several rows and columns of df_n will be equal to the same rows and columns of df. I have a mask m and list of columns l where df_n differs from df, and I have also a dataframe df_small of the differences, such that df_n[m][l] = df_small.

I would like to say then that df_n[~m][~l] = df[~m][~l]. In order to save memory, I want to avoid creating any intermediate copies of df. It is probably a trivial problem, but I am struggling to achieve it. The final result has to be that df_n references df for [~m][~l], and new memory is occupied by only df_small. How can this be done?

3
  • 1
    I'm not quite confident enough to write an answer, but I don't think Pandas has what you're looking for. To the best of my knowledge, it makes no provision for distinct dataframes to share parts of the storage for their contents. It is possible in principle to roll something like that yourself, but I anticipate that that would be a fairly major project of its own. Commented Aug 8 at 20:42
  • Could you explain this more? "The final result has to be that df_n references df for [~m][~l], and new memory is occupied by only df_small" Do you just want df_n to take on the values of df for all [~m][~l]? I do not know of a way to do this without allocating memory to df_n, so is that okay? Commented Aug 9 at 2:21
  • Hi @Ev09, yes I want that but the new memory allocation has always to be for df_small only, even in the intermediate steps. Is this what you mean? Commented Aug 9 at 9:00

1 Answer 1

0

You can create df_n conditionally from df. Get the complements of your lists m and l, and then filter to only include entries of df where the columns and rows are in your complements:

# First, take complements of m and l
mC = list(set(rowList) - set(m))
lC = list(set(colList) - set(l))

# Create df_n
df_n = df.loc[mC, lC]

You do not need to create any intermediary dataframes or reference df_small to create df_n, but df_n will take memory since we are copying the slice of the original df. If want df_n to be a view of df[~m][~l], I would advise using numpy rather than pandas: https://numpy.org/doc/stable/user/basics.copies.html

Full code:

# Import packages
import pandas as pd
import numpy as np
import random

# Setup example
rowSize = int(1e6)
colSize = 100

rowList = list(range(mSize))
colList = list(range(lSize))

df = pd.DataFrame(np.random.randn(rowSize, colSize), columns = [i for i in range(100)])
m = random.sample(rowList, 1000)
l = random.sample(colList, 10)

# First, take complements of m and l
mC = list(set(rowList) - set(m))
lC = list(set(colList) - set(l))

# Create df_n
df_n = df.loc[mC, lC]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.