16
import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 10))

dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()

new_data = np.random.rand(5, 10)

print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))

On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).

Why is this the case?

Edit: Removed speculation about proxy objects.

As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).

Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

3
  • 3
    Under the hood, df[[True, False] * 5] calls Dataframe.__getitem__() which calls Dataframe._getitem_array() when the indexer is a list. This in turn calls Dataframe.take(), which has a property is_copy. I've found that if I run df.take([0,2,4,6,8], is_copy=True), I get speeds slower than df.take([0,2,4,6,8], is_copy=False), with is_copy=True producing equal runtime to dft in your example, and is_copy=False producing equal runtime to dft2. So, the slowdown arises somewhere down the line because of this is_copy property, perhaps during Dataframe.__setitem__. Commented Jul 8, 2016 at 0:50
  • 2
    What the is_copy property is actually used for, however, is pretty murky, and it will probably take some digging in __setitem__. I think your feeling about the returned array being a view/proxy is a good one, and I think it has to do with this property. Commented Jul 8, 2016 at 0:52
  • Thanks @c.leather. Wonder what those checks are. Commented Jul 13, 2016 at 16:42

1 Answer 1

5
+25

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.

The comments from @c.leather are on the right track. The problem is that dft is a view, not a copy of the dataframe df, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.

This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy() idiom, one may call it the slice & copy idiom.

I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

Sign up to request clarification or add additional context in comments.

1 Comment

The question isn't about view vs. copy, it's about the reason for the speed difference. I think my speculation about proxy objects is misleading (and am striking it out). Thanks for the links to the github page!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.