1

I converted below data into DataFrame which looks as below

data = [
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "a", "value": 1},
       {"start_ts": "2018-05-14 11:54:33", "end_ts": "2018-05-14 12:54:33", "product": "a", "value": 1}, 
       {"start_ts": "2018-05-14 13:54:33", "end_ts": "2018-05-14 14:54:33", "product": "a", "value": 1},          
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "b", "value": 1}
   ]

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 11:54:33 1
1   a       2018-05-14 11:54:33 2018-05-14 12:54:33 1
2   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
3   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm trying to bucketize above DF rows into one row by finding contiguous timestamp fields (where start_ts is equal to the prior row's end_ts for a product) for a product and sum the value column like below.

Expected:

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 12:54:33 2
1   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
2   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm unable to get the expected above value using the code below

def merge_dates(grp):
    date_groups = (grp['start_ts'] != grp['end_ts'].shift())
    return grp.groupby(date_groups).agg({'start_ts': 'first', 'end_ts': 'last'})   

df.groupby(["product"]).apply(merge_dates)

Need some advice. Any help would be greatly appreciated!

Thanks

2
  • @BradSolomon yes, "contiguous" means that start_ts is equal to the prior row's end_ts. Also, updated the condition in the description as well. Thank you Commented Apr 22, 2018 at 18:46
  • @BradSolomon Updated the description section Commented Apr 22, 2018 at 19:02

1 Answer 1

3

I believe this will work:

df.groupby(['product', (df.start_ts != df.end_ts.shift()).cumsum()], \
           as_index=False).agg({'start_ts':min, 'end_ts':max, 'value':sum})

#   product              end_ts            start_ts  value
# 0       a 2018-05-14 12:54:33 2018-05-14 10:54:33      2
# 1       a 2018-05-14 14:54:33 2018-05-14 13:54:33      1
# 2       b 2018-05-14 11:54:33 2018-05-14 10:54:33      1

This approach groups by product and by a cumsum of the boolean series created by df.start_ts != df.end_ts.shift(); the boolean series serves as a counter that increases by one each time df.start_ts does not equal the previous row's end_ts (i.e. df.end_ts.shift()), thus indicating when a new group should start.

Sign up to request clarification or add additional context in comments.

1 Comment

Very clever with the use of .cumsum to get a grouping element.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.