199

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.

If I have, for example,

df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))

I understand that a query returns a copy so that something like

foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40

will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as

df.iloc[3] = 70

or

df.ix[1,'B':'E'] = 222

will change df. But I'm lost when it comes to more complicated cases. For example,

df[df.C <= df.B] = 7654321

changes df, but

df[df.C <= df.B].ix[:,'B':'E']

does not.

Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?


Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.

2 Answers 2

204

Here's the rules, subsequent override:

  • All operations generate a copy

  • If inplace=True is provided, it will modify in-place; only some operations support this

  • An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.

  • An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)

  • An indexer that gets on a multiple-dtyped object is always a copy.

Your example of chained indexing

df[df.C <= df.B].loc[:,'B':'E']

is not guaranteed to work (and thus you should never do this).

Instead do:

df.loc[df.C <= df.B, 'B':'E']

as this is faster and will always work

The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.

Sign up to request clarification or add additional context in comments.

30 Comments

.query will ALWAYS return a copy because of what its doing (and not a view), because its evaluated by n numexpr. So i'll add that to the 'rules'
pandas relies on numpy to determine whether a view is generated. In a single dtype case (which could be a 1-d for a series, a 2-d for a frame, etc). numpy may generate a view; it depends on what you are slicing; sometimes you can get a view and sometimes you can't. pandas doesn't rely on this fact at all as its not always obvious whether a view is generated. but this doesn't matter as loc doesn't rely on this when setting. However, when chain indexing this is very important (and thus why chain indexing is bad)
Many thanks Jeff, your reply is most useful. What is your source/reference on this topic?
Then first, thanks for your great work! And second, if you have enough time I think it would be great to add a paragraph similar to your main reply in the doc.
certainly would a take a pull-request to add/revise the docs. go for it.
|
6

Since pandas 1.5.0, pandas has Copy-on-Write (CoW) mode that makes any dataframe/Series derived from another behave like a copy on views. When it is enabled, a copy is created only if data is shared with another dataframe/Series. With CoW disabled, operations like slicing creates a view (and unexpectedly changed the original if the new dataframe is changed) but with CoW, this creates a copy.

pd.options.mode.copy_on_write = False   # disable CoW (this is the default as of pandas 2.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})

df1 = df.iloc[:4]                       # view
df1.iloc[0] = 100
df.equals(df1)                          # True <--- df changes together with df1



pd.options.mode.copy_on_write = True    # enable CoW (this is planned to be the default by pandas 3.0)
df = pd.DataFrame({'A': range(4), 'B': list('abcd')})

df1 = df.iloc[:4]                       # copy because data is shared
df1.iloc[0] = 100
df.equals(df1)                          # False <--- df doesn't change when df1 changes

One consequence is, pandas operations are faster with CoW. In the following example, in the first case (when CoW is disabled), all intermediate steps create copies, while in the latter case (when CoW is enabled), a copy is created only at assignment (all intermediate steps are on views). You can see that there's a runtime difference because of that (in the latter case, data was not unnecessarily copied).

df = pd.DataFrame({'A': range(1_000_000), 'B': range(1_000_000)})

%%timeit
with pd.option_context('mode.copy_on_write', False):  # disable CoW in a context manager
    df1 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 30.5 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
with pd.option_context('mode.copy_on_write', True):   # enable CoW in a context manager
    df2 = df.add_prefix('col ').set_index('col A').rename_axis('index col').reset_index()
# 18 ms ± 513 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.