2

I have a csv file that looks like this when I load it:

# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')

# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y

>>> y.head()
user  datehour           
A     2020-02-01 00:00:00    0
      2020-02-01 01:00:00    0
      2020-02-01 02:00:00    1
      2020-02-01 03:00:00    0
      2020-02-01 04:00:00    0
Name: y, dtype: int64

I have the following function to create a lagged feature of an index level:

def shift_index(a, dt_idx_name, lag_freq, lag):

    # get datetime index of relevant level
    ac = a.copy()
    dti = ac.index.get_level_values(dt_idx_name)

    # shift it
    dti_shifted = dti.shift(lag, freq=lag_freq)

    # put it back where you found it
    ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)

    return ac

But when I run: y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:

ValueError: Level values must be unique...

(I can actually suppress this error by adding verify_integrity=False in .index.set_levels... in the function, but that (predictably) causes problems down the line)

Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.

That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing. df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.

So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.

1
  • 1
    Great question. I have a similar issue, that the freq is set to None after an opertation - in my case, when I reindex with df.index = pd.MultiIndex.from_arrays([qhour.index, qhour.index.year]). I'm interested to see if there is a more... straightforward... way of doing things (no offence @mcskinner ;)) Commented Apr 20, 2020 at 15:05

2 Answers 2

2

After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.

y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y

Peeking at an index value shows the correct frequency.

y.index[0]                                                                                                                                                                                                                          
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))

All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.

If you try asfreq on a non-unique index, it will not work.

y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis
Sign up to request clarification or add additional context in comments.

3 Comments

Wow, setting an index inside a groupby/apply/lambda to set the freq, wild. After much fiddling indeed I am sure. I am super impressed that you figured that out, and I can confirm it absolutely works (+1). Hope you don't mind I'm gonna wait a day or two and see if anyone comes up with a simpler / more idiomatic way, which I gotta think must exist?
Totally fine. Glad this helped! And yeah, pretty wild. Pandas is great when it does what you want out of the box, but can get pretty awkward if it doesn't.
posted a question for the pandas devs here, we'll see if they have a simpler way. github.com/pandas-dev/pandas/issues/33647
0

inspired by the answer of @mcskinner, you can use .resample() in place of lambda function.

y = y.groupby('user').resample('H', level = 'datehour').y

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.