5

I'm in a performance critical field, where we store our results in pandas dataframes - issue is we are doing most of computations in numpy and then assigning to pd later - but this forces a copy on assign: df['col'] = arr # this will create a copy

My quesiton: is there a pandas friendly way to assign in a way that will not break in the future? Is this feature in the pipeline? I currently found df._set_item_mgr('col', arr) but am wary of potential changes in the future.

I was thinking of making an issue on GH, but wanted to see what everyone thinks before submitting :)

3
  • 1
    I don't think so. You can avoid some consolidation overhead by defining multiple columns at once, but I don't think you can avoid copies entirely. Commented Jun 27 at 16:56
  • Is it enough for your needs to delete the array arr after you have made a copy? Use del arr or arr = whatever_else to delete arr and thus free up the memory? Maybe also use import gc and do gc.collect() after del arr. Commented Jun 27 at 16:56
  • 1
    @TimurShtatland its not actually memory, its timing, I'm running grid a lot of compute (parameter grid search) where I expect up to 50% of time is just moving memory around Commented Jun 27 at 17:07

1 Answer 1

4

It's technically possible, but I'm not sure it meets all your requirements.

To the best of my knowledge, pandas doesn’t offer an assign operation without a deep copy. However, it does allow you to create a new DataFrame without making a deep copy.

import numpy as np
import pandas as pd

arr1 = np.arange(5)  # Result of computation #1.

# Create a new DataFrame without copying.
df = pd.DataFrame({"arr1": arr1}, copy=False)

# Verify it’s a shallow copy.
arr1[0] = 7
print(df)
   arr1
0     7
1     1
2     2
3     3
4     4

You can also create a new DataFrame from an existing one. Although this isn’t in-place, it produces the same result as an assign.

arr2 = np.ones(5, dtype=np.float32)  # Result of computation #2.

# Create a new DataFrame from the existing one without copying.
df = pd.DataFrame({**{c: df[c] for c in df}, "arr2": arr2}, copy=False)

# Verify it’s a shallow copy.
arr1[0] = 7
arr2[0] = 8
print(df)
   arr1  arr2
0     7   8.0
1     1   1.0
2     2   1.0
3     3   1.0
4     4   1.0

One important thing to note is that the copy=False option does NOT guarantee that no deep copy will be made. For example, if the data is a Python list rather than a numpy array, it just makes the copy silently. So, you need to take extra care to avoid passing such an object.

Another option is to first create a column in the DataFrame and write the results directly there.

arr1 = np.arange(5)  # Result of computation #1.

# Create the output buffer in the data frame first.
df = pd.DataFrame({"arr1": arr1, "arr2": np.zeros_like(arr1, dtype=np.float32)})

# Use the out argument to directly write out the calculation results.
# Most numpy operations support the out argument.
np.power(arr1, 2, out=df["arr2"].array)
print(df)

Although this is not a direct answer to your question, it may offer better performance.

Sign up to request clarification or add additional context in comments.

2 Comments

hey! many thanks, this was my first approach actually - only caveat is that this gets very slow for wide datasets with 50+ columns as you have to reconstruct each individually. But for very tall datasets (1M+rows) its definitely a good option :)
This is a helpful trick, and I've used it before to speed up real code. It's worth mentioning that later operations can trigger consolidation. If this happens, it will erase performance gains as all of those other columns get copied. uwekorn.com/2020/05/24/the-one-pandas-internal.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.