0

I have a dataframe of several thousand timeseries.

  • Each timeseries is identified by an integer

For each timeseries, there is a unique timestamp, so I can enforce the order.

  • For the example below I replace the timestamp with an integer idx

Each timeseries has a state column

  • For the example below the state is 0 or 1
  • But I calculate that and can use NaN or whatever makes it easier

I need to 'discard' all rows before there have been a certain number of consecutive 1's

  • It's 30, but for the example below I'm saying 2, to keep the example concise

So, here's some sample data...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
})

The results that I want are...

desired_result = pd.DataFrame({
    'group': [        1,1,1,     2,2,2,2,2],
    'idx':   [        4,5,6,     2,3,4,5,6],
    'value': [        1,1,1,     1,1,0,1,0],
})

What I think I need to calculate is...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
   #'consec':[0,1,0,1,2,3,4, 0,1,2,3,0,1,0], -- the cumulative sum of value, but resetting whenever a 0 is encountered
   #'max_c': [0,1,1,1,2,3,4, 0,1,2,3,3,3,3], -- the cumulative max of consec
   #                  ^ ^ ^      ^ ^ ^ ^ ^   -- rows I want to keep, as max_c >= 2
})

Then I can just take the rows where test[ test['max_c'] >= 2 ]


But, how do I calculate consec?

  • The cumulative sum of value, resetting at 0's, independently by group?



EDIT: My best attempt, but it feels ridiculously long winded...

test['cumsum'] = test.groupby(['group'])['value'].cumsum()

test['reset'] = test['cumsum'][ test.groupby(['group'])['value'].diff() == -1 ]
test['reset'] = test['reset'].fillna(0)

test['reset_cummax'] = test.groupby(['group'])['reset'].cummax()

test['consec'] = test['cumsum'] - test['reset_cummax']

test['c_max'] = test.groupby(['group'])['consec'].cummax()

1 Answer 1

2

IIUC, you can do the cumsum after a groupby on the column group and every time the column value is eq to 0, you create a new group with cumsum as well.

test['consec'] = test.groupby(['group', test['value'].eq(0).cumsum()])['value'].cumsum()
test['max_c'] = test.groupby(['group'])['consec'].cummax()
print(test)
    group  idx  value  consec  max_c
0       1    0      0       0      0
1       1    1      1       1      1
2       1    2      0       0      1
3       1    3      1       1      1
4       1    4      1       2      2
5       1    5      1       3      3
6       1    6      1       4      4
7       2    0      0       0      0
8       2    1      1       1      1
9       2    2      1       2      2
10      2    3      1       3      3
11      2    4      0       0      3
12      2    5      1       1      3
13      2    6      0       0      3
Sign up to request clarification or add additional context in comments.

2 Comments

It feels like the test['value'].eq(0).cumsum() should have its own .groupby(['group'])? Or, am I being overly paranoid, as even if it spans two groups, it won't matter anyway, due to it being part of a groupby that already includes group? (I think this works, Thank You, just shooting some other test data at it)
@MatBailie the group will restart for any new value in group as both group and test['value'].eq(0).cumsum() are in the groupby. So even if a group start with a value of 1, a new group hence a new cumsum will start there :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.