Shift DateTime index within a Pandas MultiIndex

Question

I have a csv file that looks like this when I load it:

# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')

# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y

>>> y.head()
user  datehour           
A     2020-02-01 00:00:00    0
      2020-02-01 01:00:00    0
      2020-02-01 02:00:00    1
      2020-02-01 03:00:00    0
      2020-02-01 04:00:00    0
Name: y, dtype: int64

I have the following function to create a lagged feature of an index level:

def shift_index(a, dt_idx_name, lag_freq, lag):

    # get datetime index of relevant level
    ac = a.copy()
    dti = ac.index.get_level_values(dt_idx_name)

    # shift it
    dti_shifted = dti.shift(lag, freq=lag_freq)

    # put it back where you found it
    ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)

    return ac

But when I run: y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:

ValueError: Level values must be unique...

(I can actually suppress this error by adding verify_integrity=False in .index.set_levels... in the function, but that (predictably) causes problems down the line)

Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.

That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing. df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.

So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.

Great question. I have a similar issue, that the freq is set to None after an opertation - in my case, when I reindex with df.index = pd.MultiIndex.from_arrays([qhour.index, qhour.index.year]). I'm interested to see if there is a more... straightforward... way of doing things (no offence @mcskinner ;)) — ElRudi
– ElRudi, Commented Apr 20, 2020 at 15:05

mcskinner · Accepted Answer · 2020-04-18 23:26:29Z

2

After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.

y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y

Peeking at an index value shows the correct frequency.

y.index[0]                                                                                                                                                                                                                          
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))

All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.

If you try asfreq on a non-unique index, it will not work.

y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis

edited Apr 18, 2020 at 23:26

answered Apr 18, 2020 at 23:20

mcskinner

2,7682 gold badges15 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Max Power Over a year ago

Wow, setting an index inside a groupby/apply/lambda to set the freq, wild. After much fiddling indeed I am sure. I am super impressed that you figured that out, and I can confirm it absolutely works (+1). Hope you don't mind I'm gonna wait a day or two and see if anyone comes up with a simpler / more idiomatic way, which I gotta think must exist?

mcskinner Over a year ago

Totally fine. Glad this helped! And yeah, pretty wild. Pandas is great when it does what you want out of the box, but can get pretty awkward if it doesn't.

Max Power Over a year ago

posted a question for the pandas devs here, we'll see if they have a simpler way. github.com/pandas-dev/pandas/issues/33647

ktp89 · Accepted Answer · 2025-03-14 21:02:52Z

0

inspired by the answer of @mcskinner, you can use .resample() in place of lambda function.

y = y.groupby('user').resample('H', level = 'datehour').y

edited Mar 14 at 21:02

answered Mar 14 at 21:02

ktp89

11 bronze badge

Collectives™ on Stack Overflow

Shift DateTime index within a Pandas MultiIndex

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related