3

Having this kind of pandas dataframe

df = pd.DataFrame({
    'ts_diff':[0, 0, 738, 20, 29, 61, 42, 18, 62, 41, 42, 0, 0, 729, 43, 59, 42, 61, 44, 36, 61, 61, 42, 18, 62, 41, 42, 0, 0]
})

ts_diff - is duration in milliseconds between events

Would like to generate another column ts_diff_incr that would be based on ts_diff.

ts_diff_incr - would be duration in msec between current step and event start (where ts_diff was zero)

Please see the calculation logic in the image below.

What is the best way to achieve this without for loops and instead using Vectorization?

enter image description here

1
  • 3
    df['ts_diff_incr'] = df.groupby(df['ts_diff'].eq(0).cumsum()).cumsum() Commented Nov 6 at 15:24

2 Answers 2

3

You can do this efficiently using cumsum(), groupby().cumsum(), and cumsum().where() tricks:

import pandas as pd

df['group'] = (df['ts_diff'] == 0).cumsum()
df['ts_diff_incr'] = df.groupby('group')['ts_diff'].cumsum()
  1. (df['ts_diff'] == 0).cumsum()
    → creates a new “group id” every time a zero appears.

  2. groupby('group')['ts_diff'].cumsum()
    → computes cumulative time within each segment.

and finaly clean the column "group":

df = df.drop(columns='group')
Sign up to request clarification or add additional context in comments.

Comments

0

It's called cumulative sum.

In this case, it looks like you want to "reset" at each 0.

df.groupby((df['ts_diff'] == 0).cumsum()).cumsum()
    ts_diff
0         0
1         0
2       738
3       758
4       787
5       848
6       890
7       908
8       970
9      1011
10     1053
11        0
12        0
13      729
14      772
15      831
16      873
17      934
18      978
19     1014
20     1075
21     1136
22     1178
23     1196
24     1258
25     1299
26     1341
27        0
28        0

cumsum is initially used to generate numbers for each run/streak of non-zero values.

(df['ts_diff'] == 0).cumsum()
0     1
1     2
2     2
3     2
4     2
5     2
6     2
7     2
8     2
9     2
10    2
11    3
12    4
13    4
14    4
15    4
16    4
17    4
18    4
19    4
20    4
21    4
22    4
23    4
24    4
25    4
26    4
27    5
28    6
Name: ts_diff, dtype: int64

We then calculate the cumsum for each group.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.