Python pandas dataframes: filtering around key dates

Question

I have a pandas date-frame df with the index being a daily DatetimeIndex, and an attached column with historical_sales.

If we wanted to filter days in the past where historical_sales were greater than a large number, say 200, easy enough:

df.loc[df['historical_sales'>200]]

I'm wondering however, what if we wanted to explore sales patterns on 5 days preceding AND following days when sales were > 200?

Many thanks.

If want to filter for days around a list_of_dates: [ date1, date2, ...dateN), make sure to convert the list_of_dates = pd.to_datetime(list_of_dates) for the temp_index below to work. Else you get KeyError: '/' — Fed
– Fed, Commented Mar 21, 2018 at 9:06

jezrael · Accepted Answer · 2017-06-13 06:43:44Z

I think youn need get all index values by list comprehension and then select by loc.

Also is necessary use numpy.concatenate for join all indexes together with numpy.unique for remove duplicates.

np.random.seed(100)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'historical_sales': np.random.choice([100,200,300], size=20)}, index=rng)
print (df)
            historical_sales
2017-04-03               100
2017-04-04               100
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-16               100
2017-04-17               100
2017-04-18               100
2017-04-19               100
2017-04-20               300
2017-04-21               100
2017-04-22               200

idxmask = df.index[df['historical_sales']>200]
print (idxmask)
DatetimeIndex(['2017-04-06', '2017-04-07', '2017-04-09', '2017-04-11',
               '2017-04-12', '2017-04-13', '2017-04-14', '2017-04-20'],
              dtype='datetime64[ns]', freq=None)

#in real data change 1 to 5 for 5 days
temp_index = [df.loc[timestamp - pd.Timedelta(1, unit='d') : 
                     timestamp + pd.Timedelta(1, unit='d')].index for timestamp in idxmask]
idx = np.unique(np.concatenate(temp_index))

df1 = df.loc[idx]
print (df1)
            historical_sales
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-19               100
2017-04-20               300
2017-04-21               100

Worked like a wonder, many thanks! Only thing there was a typo i think: mask should be idxmask.

Christian Sauer · Accepted Answer · 2017-06-13 05:21:58Z

0

You would want to do range slicing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

Should look like this (code is pseudocode):

great_sales_df = df.loc[df['historical_sales'>200]]
for sale in great_sales_df:
   sales_date = great_sales_df["date"]
   sales_before = sales_date + pd.DateOffset(-5)
   sales_after =  sales_date + pd.DateOffset(+5)
   pattern_df = df.iloc[sales_before:sales_after]

This code will not work, but I think the direction is right.

answered Jun 13, 2017 at 5:21

Christian Sauer

11.2k12 gold badges57 silver badges88 bronze badges

Comments

Prakriti Gupta · Accepted Answer · 2017-06-13 06:06:04Z

0

When I need to work with rows before and after, I simply perform a shift.

df['preceeding_5th_day'] = df['historical_sales'].shift(5)
df['following_5th_day'] = df['historical_sales'].shift(-5)

Then, you can simply make your check and do

df.loc[df['historical_sales'>200]]

The selected rows then will also have columns about preceeding and following 5th day. Pretty straightforward this way.

answered Jun 13, 2017 at 6:06

Prakriti Gupta

1116 bronze badges

Comments

Abhishek · Accepted Answer · 2017-06-13 06:24:14Z

0

For the sake of clarity, I am adding a new column which is set to 1 for the row of interest. And for the easy verification the number of window dates have been kept to 1 instead of 5, in the code below

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

df = pd.DataFrame(data=np.random.rand(51),index=pd.date_range('2015-04-20','2015-06-09'),columns=['A'])
idx = df[df.A >0.5].index

df["new"] = 0

for date in idx:
    current_date = date.to_pydatetime()
    start = current_date - timedelta(days=1)
    end = current_date + timedelta(days=1)

    df.loc[start:current_date]["new"] = 1
    df.loc[current_date:end]["new"] = 1


print(df)

edited Jun 13, 2017 at 6:24

answered Jun 13, 2017 at 6:02

Abhishek

3,4174 gold badges36 silver badges51 bronze badges

Collectives™ on Stack Overflow

Python pandas dataframes: filtering around key dates

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related