import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 10))
dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()
new_data = np.random.rand(5, 10)
print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))
On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).
Why is this the case?
Edit: Removed speculation about proxy objects.
As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).
Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?
df[[True, False] * 5]callsDataframe.__getitem__()which callsDataframe._getitem_array()when the indexer is a list. This in turn callsDataframe.take(), which has a property is_copy. I've found that if I rundf.take([0,2,4,6,8], is_copy=True), I get speeds slower thandf.take([0,2,4,6,8], is_copy=False), with is_copy=True producing equal runtime to dft in your example, and is_copy=False producing equal runtime to dft2. So, the slowdown arises somewhere down the line because of this is_copy property, perhaps duringDataframe.__setitem__.__setitem__. I think your feeling about the returned array being a view/proxy is a good one, and I think it has to do with this property.