1

I have a pandas date-frame df with the index being a daily DatetimeIndex, and an attached column with historical_sales.

If we wanted to filter days in the past where historical_sales were greater than a large number, say 200, easy enough:

df.loc[df['historical_sales'>200]]

I'm wondering however, what if we wanted to explore sales patterns on 5 days preceding AND following days when sales were > 200?

Many thanks.

1
  • If want to filter for days around a list_of_dates: [ date1, date2, ...dateN), make sure to convert the list_of_dates = pd.to_datetime(list_of_dates) for the temp_index below to work. Else you get KeyError: '/' Commented Mar 21, 2018 at 9:06

4 Answers 4

1

I think youn need get all index values by list comprehension and then select by loc.

Also is necessary use numpy.concatenate for join all indexes together with numpy.unique for remove duplicates.

np.random.seed(100)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'historical_sales': np.random.choice([100,200,300], size=20)}, index=rng)
print (df)
            historical_sales
2017-04-03               100
2017-04-04               100
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-16               100
2017-04-17               100
2017-04-18               100
2017-04-19               100
2017-04-20               300
2017-04-21               100
2017-04-22               200

idxmask = df.index[df['historical_sales']>200]
print (idxmask)
DatetimeIndex(['2017-04-06', '2017-04-07', '2017-04-09', '2017-04-11',
               '2017-04-12', '2017-04-13', '2017-04-14', '2017-04-20'],
              dtype='datetime64[ns]', freq=None)

#in real data change 1 to 5 for 5 days
temp_index = [df.loc[timestamp - pd.Timedelta(1, unit='d') : 
                     timestamp + pd.Timedelta(1, unit='d')].index for timestamp in idxmask]
idx = np.unique(np.concatenate(temp_index))

df1 = df.loc[idx]
print (df1)
            historical_sales
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-19               100
2017-04-20               300
2017-04-21               100
Sign up to request clarification or add additional context in comments.

1 Comment

Worked like a wonder, many thanks! Only thing there was a typo i think: mask should be idxmask.
0

You would want to do range slicing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

Should look like this (code is pseudocode):

great_sales_df = df.loc[df['historical_sales'>200]]
for sale in great_sales_df:
   sales_date = great_sales_df["date"]
   sales_before = sales_date + pd.DateOffset(-5)
   sales_after =  sales_date + pd.DateOffset(+5)
   pattern_df = df.iloc[sales_before:sales_after]

This code will not work, but I think the direction is right.

Comments

0

When I need to work with rows before and after, I simply perform a shift.

df['preceeding_5th_day'] = df['historical_sales'].shift(5)
df['following_5th_day'] = df['historical_sales'].shift(-5)

Then, you can simply make your check and do

df.loc[df['historical_sales'>200]]

The selected rows then will also have columns about preceeding and following 5th day. Pretty straightforward this way.

Comments

0

For the sake of clarity, I am adding a new column which is set to 1 for the row of interest. And for the easy verification the number of window dates have been kept to 1 instead of 5, in the code below

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

df = pd.DataFrame(data=np.random.rand(51),index=pd.date_range('2015-04-20','2015-06-09'),columns=['A'])
idx = df[df.A >0.5].index

df["new"] = 0

for date in idx:
    current_date = date.to_pydatetime()
    start = current_date - timedelta(days=1)
    end = current_date + timedelta(days=1)

    df.loc[start:current_date]["new"] = 1
    df.loc[current_date:end]["new"] = 1


print(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.