Pandas - Cumulative Sum, with resets, using GroupBy

Question

I have a dataframe of several thousand timeseries.

Each timeseries is identified by an integer

For each timeseries, there is a unique timestamp, so I can enforce the order.

For the example below I replace the timestamp with an integer idx

Each timeseries has a state column

For the example below the state is 0 or 1
But I calculate that and can use NaN or whatever makes it easier

I need to 'discard' all rows before there have been a certain number of consecutive 1's

It's 30, but for the example below I'm saying 2, to keep the example concise

So, here's some sample data...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
})

The results that I want are...

desired_result = pd.DataFrame({
    'group': [        1,1,1,     2,2,2,2,2],
    'idx':   [        4,5,6,     2,3,4,5,6],
    'value': [        1,1,1,     1,1,0,1,0],
})

What I think I need to calculate is...

test = pd.DataFrame({
    'group': [1,1,1,1,1,1,1, 2,2,2,2,2,2,2],
    'idx':   [0,1,2,3,4,5,6, 0,1,2,3,4,5,6],
    'value': [0,1,0,1,1,1,1, 0,1,1,1,0,1,0],
   #'consec':[0,1,0,1,2,3,4, 0,1,2,3,0,1,0], -- the cumulative sum of value, but resetting whenever a 0 is encountered
   #'max_c': [0,1,1,1,2,3,4, 0,1,2,3,3,3,3], -- the cumulative max of consec
   #                  ^ ^ ^      ^ ^ ^ ^ ^   -- rows I want to keep, as max_c >= 2
})

Then I can just take the rows where test[ test['max_c'] >= 2 ]

But, how do I calculate consec?

The cumulative sum of value, resetting at 0's, independently by group?

EDIT: My best attempt, but it feels ridiculously long winded...

test['cumsum'] = test.groupby(['group'])['value'].cumsum()

test['reset'] = test['cumsum'][ test.groupby(['group'])['value'].diff() == -1 ]
test['reset'] = test['reset'].fillna(0)

test['reset_cummax'] = test.groupby(['group'])['reset'].cummax()

test['consec'] = test['cumsum'] - test['reset_cummax']

test['c_max'] = test.groupby(['group'])['consec'].cummax()

Ben.T · Accepted Answer · 2021-05-12 15:48:44Z

2

IIUC, you can do the cumsum after a groupby on the column group and every time the column value is eq to 0, you create a new group with cumsum as well.

test['consec'] = test.groupby(['group', test['value'].eq(0).cumsum()])['value'].cumsum()
test['max_c'] = test.groupby(['group'])['consec'].cummax()
print(test)
    group  idx  value  consec  max_c
0       1    0      0       0      0
1       1    1      1       1      1
2       1    2      0       0      1
3       1    3      1       1      1
4       1    4      1       2      2
5       1    5      1       3      3
6       1    6      1       4      4
7       2    0      0       0      0
8       2    1      1       1      1
9       2    2      1       2      2
10      2    3      1       3      3
11      2    4      0       0      3
12      2    5      1       1      3
13      2    6      0       0      3

answered May 12, 2021 at 15:48

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MatBailie Over a year ago

It feels like the test['value'].eq(0).cumsum() should have its own .groupby(['group'])? Or, am I being overly paranoid, as even if it spans two groups, it won't matter anyway, due to it being part of a groupby that already includes group? (I think this works, Thank You, just shooting some other test data at it)

Ben.T Over a year ago

@MatBailie the group will restart for any new value in group as both group and test['value'].eq(0).cumsum() are in the groupby. So even if a group start with a value of 1, a new group hence a new cumsum will start there :)

Collectives™ on Stack Overflow

Pandas - Cumulative Sum, with resets, using GroupBy

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related