I have a dataframe of several thousand timeseries.
- Each timeseries is identified by an integer
For each timeseries, there is a unique timestamp, so I can enforce the order.
- For the example below I replace the timestamp with an integer idx
Each timeseries has a state column
- For the example below the state is 0 or 1
- But I calculate that and can use NaN or whatever makes it easier
I need to 'discard' all rows before there have been a certain number of consecutive 1's
- It's 30, but for the example below I'm saying 2, to keep the example concise
So, here's some sample data...
test = pd.DataFrame({
'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
'idx': [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
})
The results that I want are...
desired_result = pd.DataFrame({
'group': [ 1,1,1, 2,2,2,2,2],
'idx': [ 4,5,6, 2,3,4,5,6],
'value': [ 1,1,1, 1,1,0,1,0],
})
What I think I need to calculate is...
test = pd.DataFrame({
'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
'idx': [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
#'consec':[0,1,0,1,2,3,4, 0,1,2,3,0,1,0], -- the cumulative sum of value, but resetting whenever a 0 is encountered
#'max_c': [0,1,1,1,2,3,4, 0,1,2,3,3,3,3], -- the cumulative max of consec
# ^ ^ ^ ^ ^ ^ ^ ^ -- rows I want to keep, as max_c >= 2
})
Then I can just take the rows where test[ test['max_c'] >= 2 ]
But, how do I calculate consec?
- The cumulative sum of
value, resetting at 0's, independently bygroup?
EDIT: My best attempt, but it feels ridiculously long winded...
test['cumsum'] = test.groupby(['group'])['value'].cumsum()
test['reset'] = test['cumsum'][ test.groupby(['group'])['value'].diff() == -1 ]
test['reset'] = test['reset'].fillna(0)
test['reset_cummax'] = test.groupby(['group'])['reset'].cummax()
test['consec'] = test['cumsum'] - test['reset_cummax']
test['c_max'] = test.groupby(['group'])['consec'].cummax()