8

Having this DataFrame:

import pandas

dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)

df

enter image description here

I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).

When I try:

df['foo'][0] = df['foo'][0].resample('D').sum().dropna()

That seems to work well:

enter image description here

However, I get a warning:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The question is, how should I do this instead?

Notes

Things I have tried but do not work (resampling or not, the assignment raises an exception):

df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']

A bit more information about the data (in case it is relevant):

  • The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
  • The real DataFrame has, of course, many more rows in it (thousands).
  • There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
  • The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).

Using Python 3.5.1 and Pandas 0.18.1.

6
  • 1
    Setting a Series inside a cell of a Dataframe makes me think you may want to use a pandas Panel. I'm no good with Panels myself, so I'll just leave this link here: pandas.pydata.org/pandas-docs/stable/generated/… Commented Jun 3, 2016 at 14:03
  • Can you provide a little more definition of you data? with foo always be a series? Will bar always be one number? Could you provide more rows? Commented Jun 10, 2016 at 3:47
  • 1
    @tmthydvnprt: I just updated my question with a bit more information about the data (see the "Notes" section). :-) Commented Jun 10, 2016 at 7:19
  • I am still a little unclear the how it looks, could you make an example df with some of these corner cases? Commented Jun 10, 2016 at 11:19
  • 1
    @tmthydvnprt: I don't think it is worthy. Although I appreciate your help, I cannot really change the way data looks like (it is not up to me to decide how those DataFrames are created/stored). And I think your answer goes in that way. :-) So, I think my problem can really be simplified as posted in the question (that is the exact problem and how it can be reproduced, the rest does not really matter). Commented Jun 10, 2016 at 11:31

3 Answers 3

3

This should work:

df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()

Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.

Sign up to request clarification or add additional context in comments.

Comments

1

Simply set df.is_copy = False before asignment of new value.

Comments

1
+50

Hierarchical data in pandas

It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).

Restructured Data

import pandas as pd

# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)

# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)

print df

a                    1
b                    2
bar                  8
2016-01-01 00:00:00  0
2016-01-01 01:00:00  1
2016-01-01 02:00:00  2
2016-01-01 03:00:00  3
2016-01-01 04:00:00  4

Resampling

With the data in this format, resampling becomes very simple.

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()

print df_resampled

a            1
b            2
bar          8
2016-01-01  10

Update (from data description)

If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:

# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)

# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))

df = pd.concat([df1, df2])
print df

a                      1      2
b                      2      5
bar                    8      5
c                   cat1   cat3
2016-01-01 00:00:00  0.0    NaN
2016-01-01 01:00:00  1.0    NaN
2016-01-01 02:00:00  2.0    NaN
2016-01-01 03:00:00  3.0    NaN
2016-01-01 04:00:00  4.0    NaN
2016-01-14 00:00:00  NaN -200.0
2016-01-14 01:00:00  NaN   10.0
2016-01-14 02:00:00  NaN   24.0
2016-01-14 03:00:00  NaN   30.0
2016-01-14 04:00:00  NaN   40.0
2016-01-14 05:00:00  NaN  100.0

The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:

# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')

print df_resampled

a              1    2
b              2    5
bar            8    5
c           cat1 cat3
2016-01-01  10.0  NaN
2016-01-14   NaN  4.0

3 Comments

As you may have noticed after the update (sorry for not providing this information before), this may not be possible, as the series contained in the DataFrame have a variable length and different index (and yes, there are many series contained in the DataFrame).
You can have variable lengths and disparate indexes! The DataFrame will just be nan where ever they do not overlap.
Thanks for your answer. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.