9

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y

user_id     date       val
1           1-1-17     1
2           1-1-17     1
3           1-1-17     1
1           1-1-17     1
1           1-2-17     1
2           1-2-17     1
2           1-10-17    1
3           2-1-17     1

The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:

user_id     date       sum(val)
1           1-2-17     3
2           1-2-17     2
2           1-10-17    1
3           1-1-17     1
3           2-1-17     1

Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..

Thanks!

2 Answers 2

19

I'd convert this to a datetime column and then use pd.TimeGrouper:

dates =  pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0   2017-01-01
1   2017-01-01
2   2017-01-01
3   2017-01-01
4   2017-01-02
5   2017-01-02
6   2017-01-10
7   2017-02-01
Name: date, dtype: datetime64[ns]

df = (df.assign(date=dates).set_index('date')
        .groupby(['user_id', pd.TimeGrouper('3D')])
        .sum()
        .reset_index())    
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

Similar solution using pd.Grouper:

df = (df.assign(date=dates)
        .groupby(['user_id', pd.Grouper(key='date', freq='3D')])
        .sum()
        .reset_index())
print(df)
   user_id       date  val
0        1 2017-01-01    3
1        2 2017-01-01    2
2        2 2017-01-10    1
3        3 2017-01-01    1
4        3 2017-01-31    1

Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).

Sign up to request clarification or add additional context in comments.

7 Comments

I always afraid to touch any time related question ... LOL btw +1
Amazing, never used grouper somehow
Grouper is TimeGrouper
Thanks both :) @Wen, yeah I used to run away from date problems as well. Also, yeah, you're right, the only difference being TimeGrouper needs the index to be a datetime index.
@Wen It was my first choice, but the datetime column seems to disappear... uff... I didn't like reset_index either but no choice..
|
2

I come with a very ugly solution but still work...

df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})

Out[586]: 
   user_id  Key  val       date
0        1    1    3 2017-01-01
1        2    2    2 2017-01-01
2        2    3    1 2017-01-10
3        3    4    1 2017-01-01
4        3    5    1 2017-02-01

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.