Combine Pandas DataFrame Rows by Timestamp and Column

Question

I converted below data into DataFrame which looks as below

data = [
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "a", "value": 1},
       {"start_ts": "2018-05-14 11:54:33", "end_ts": "2018-05-14 12:54:33", "product": "a", "value": 1}, 
       {"start_ts": "2018-05-14 13:54:33", "end_ts": "2018-05-14 14:54:33", "product": "a", "value": 1},          
       {"start_ts": "2018-05-14 10:54:33", "end_ts": "2018-05-14 11:54:33", "product": "b", "value": 1}
   ]

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 11:54:33 1
1   a       2018-05-14 11:54:33 2018-05-14 12:54:33 1
2   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
3   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm trying to bucketize above DF rows into one row by finding contiguous timestamp fields (where start_ts is equal to the prior row's end_ts for a product) for a product and sum the value column like below.

Expected:

    product start_ts            end_ts              value
0   a       2018-05-14 10:54:33 2018-05-14 12:54:33 2
1   a       2018-05-14 13:54:33 2018-05-14 14:54:33 1
2   b       2018-05-14 10:54:33 2018-05-14 11:54:33 1

I'm unable to get the expected above value using the code below

def merge_dates(grp):
    date_groups = (grp['start_ts'] != grp['end_ts'].shift())
    return grp.groupby(date_groups).agg({'start_ts': 'first', 'end_ts': 'last'})   

df.groupby(["product"]).apply(merge_dates)

Need some advice. Any help would be greatly appreciated!

Thanks

@BradSolomon yes, "contiguous" means that start_ts is equal to the prior row's end_ts. Also, updated the condition in the description as well. Thank you — Sunil
– Sunil, Commented Apr 22, 2018 at 18:46

cmaher · Accepted Answer · 2018-04-22 19:40:01Z

3

I believe this will work:

df.groupby(['product', (df.start_ts != df.end_ts.shift()).cumsum()], \
           as_index=False).agg({'start_ts':min, 'end_ts':max, 'value':sum})

#   product              end_ts            start_ts  value
# 0       a 2018-05-14 12:54:33 2018-05-14 10:54:33      2
# 1       a 2018-05-14 14:54:33 2018-05-14 13:54:33      1
# 2       b 2018-05-14 11:54:33 2018-05-14 10:54:33      1

This approach groups by product and by a cumsum of the boolean series created by df.start_ts != df.end_ts.shift(); the boolean series serves as a counter that increases by one each time df.start_ts does not equal the previous row's end_ts (i.e. df.end_ts.shift()), thus indicating when a new group should start.

answered Apr 22, 2018 at 19:40

cmaher

5,2641 gold badge24 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

iDrwish Over a year ago

Very clever with the use of .cumsum to get a grouping element.

Collectives™ on Stack Overflow

Combine Pandas DataFrame Rows by Timestamp and Column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related