0

I have a time series dataset as below. I would like to split this into multiple 20s bins, get the min and max timestamps in each bin and add a flag to each bin based on whether there is at least 1 successful result (success: result = 0; failed: result = 1)

data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 0},
    {"product": "abc", "test_tstamp": 1530693430, "result": 0},
    {"product": "abc", "test_tstamp": 1530693431, "result": 0}]

I'm able to cut the data into 20s intervals using pandas.cut()and get the min and max timestamps for each bin

import numpy as np
import pandas as pd
arange = np.arange(1530693398, 1530693440, 20)
data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
    {"product": "abc", "test_tstamp": 1530693405, "result": 0},
    {"product": "abc", "test_tstamp": 1530693410, "result": 1},
    {"product": "abc", "test_tstamp": 1530693411, "result": 0},
    {"product": "abc", "test_tstamp": 1530693415, "result": 0},
    {"product": "abc", "test_tstamp": 1530693420, "result": 1},
    {"product": "abc", "test_tstamp": 1530693430, "result": 1},
    {"product": "abc", "test_tstamp": 1530693431, "result": 1}]
df = pd.DataFrame(data)
df['bins'] = pd.cut(df['test_tstamp'], arange)
output_1 = df.groupby(["bins"]).agg({'result': np.ma.count, 'test_tstamp': {'mindate': np.min, 'maxdate': np.max}})

                         test_tstamp               result
                         maxdate     mindate       count
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399      5
(1530693418, 1530693438]  1530693431  1530693420      3

and able to find result success and result failed using groupby()

output_2 = df.groupby(["bins", "result"]).result.count()
                                     result
 bins                     result        
 (1530693398, 1530693418] 0            3
                          1            2
 (1530693418, 1530693438] 0            3

I'm not sure how to combine output_1 and output_2 so that instead of result count column above, I would like to have result success, result failed and flag columns associated with each bin.

Expected Output:

                             test_tstamp               result    flag
                         maxdate     mindate      success failed  
bins                                                   
(1530693398, 1530693418]  1530693415  1530693399  3         2     True
(1530693418, 1530693438]  1530693431  1530693420  0         3    False

Any pointers would help! Thank you!

1
  • Worked? Didn't work? Commented Jul 9, 2018 at 15:26

1 Answer 1

1

Unstack outptut_2 and then concatenate the two outputs:

output_2 = (
    output_2
       .unstack(fill_value=0)
       .rename(columns={0 : 'success', 1 : 'failed'}))

df = (pd.concat([output_1.test_tstamp, output_2], axis=1, keys=['test_tstamp', 'result'])
        .assign(flag=output_2.success.gt(0)))

                         test_tstamp              result          flag
result                       mindate     maxdate success failed       
bins                                                                  
(1530693398, 1530693418]  1530693399  1530693415       3      2   True
(1530693418, 1530693438]  1530693420  1530693431       0      3  False
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.