Grouping by date range with pandas

Question

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y

user_id     date       val
1           1-1-17     1
2           1-1-17     1
3           1-1-17     1
1           1-1-17     1
1           1-2-17     1
2           1-2-17     1
2           1-10-17    1
3           2-1-17     1

The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:

user_id     date       sum(val)
1           1-2-17     3
2           1-2-17     2
2           1-10-17    1
3           1-1-17     1
3           2-1-17     1

Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..

Thanks!

cs95 · Accepted Answer · 2019-01-04 10:57:05Z

19

I'd convert this to a datetime column and then use pd.TimeGrouper:

dates =  pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0   2017-01-01
1   2017-01-01
2   2017-01-01
3   2017-01-01
4   2017-01-02
5   2017-01-02
6   2017-01-10
7   2017-02-01
Name: date, dtype: datetime64[ns]

df = (df.assign(date=dates).set_index('date')
        .groupby(['user_id', pd.TimeGrouper('3D')])
        .sum()
        .reset_index())    
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

7 Comments

BENY Over a year ago

I always afraid to touch any time related question ... LOL btw +1

Vaishali Over a year ago

Amazing, never used grouper somehow

BENY Over a year ago

Grouper is TimeGrouper

cs95 Over a year ago

Thanks both :) @Wen, yeah I used to run away from date problems as well. Also, yeah, you're right, the only difference being TimeGrouper needs the index to be a datetime index.

cs95 Over a year ago

@Wen It was my first choice, but the datetime column seems to disappear... uff... I didn't like reset_index either but no choice..

|

BENY · Accepted Answer · 2017-10-19 21:47:05Z

2

I come with a very ugly solution but still work...

df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})

Out[586]: 
   user_id  Key  val       date
0        1    1    3 2017-01-01
1        2    2    2 2017-01-01
2        2    3    1 2017-01-10
3        3    4    1 2017-01-01
4        3    5    1 2017-02-01

answered Oct 19, 2017 at 21:47

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Grouping by date range with pandas

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related