Pandas Handling overflows without a loop

Question

I have a device that collects data. Type of values are 64 bit unsigned integers in two columns of this data (col1, col2). These values may overflow in some extreme cases and I need to handle them, but with conditions.

There are 4 columns: uptime, type, col1, col2. The conditions will be checked on uptime and type columns. The overflows will be handled on col1 and col2.

uptime is the time in seconds since the device rebooted, col1 and col2 holds values up until that time.

Example Data:

uptime        type        col1        col2
44            type0       980         561
104           type0       1422        902
164           type0       2304        1522
224           type1       690         623
284           type1       1603        1245
44            type1       752         698
104           type1       1304        1125

As you can see when the type change or uptime decrease, the col1 and col2 values resets too. So I only need to handle the overflow when neither the type change nor the uptime decrease.

I managed to do this with a loop by iterating through rows, as you can see below:

df[['col1', 'col2']] = df[['col1', 'col2']].fillna(0).astype(int)
of_flag_col1 = False
of_flag_col2 = False
for i, x in df.iterrows():
    if (of_flag_col1 or of_flag_col2):
        if of_flag_col1:
            if df.loc[i, 'uptime'] > df.loc[i-1, 'uptime'] or df.loc[i, 'type'] == df.loc[i-1, 'type']:
                df.loc[i, 'col1'] = 2**64 - 1 + df.loc[i, 'col1']
            else:
                of_flag_col1 = False
        if of_flag_col2:
            if df.loc[i, 'uptime'] > df.loc[i-1, 'uptime'] or df.loc[i, 'type'] == df.loc[i-1, 'type']:
                df.loc[i, 'col2'] = 2**64 - 1 + df.loc[i, 'col2']
            else:
                of_flag_col2 = False

    elif (df.loc[i, 'uptime'] > df.loc[i-1, 'uptime'] and df.loc[i, 'type'] == df.loc[i-1, 'type']):
        if df.loc[i, 'col1'] < df.loc[i-1, 'col1']:
            df.loc[i, 'col1'] = 2**64 - 1 + df.loc[i, 'col1']
            of_flag_col1 = True
        if df.loc[i, 'col2'] < df.loc[i-1, 'col2']:
            df.loc[i, 'col2'] = 2**64 - 1 + df.loc[i, 'col2']
            of_flag_col2 = True

Integers in Python has no limit, so I changed the data type to Python integer beforehand.

The conditions are basically:

if the value in uptime higher than previous value AND the value in type is same with previous value:
- if value in col1 lower than previous value:
  - add 2^64 to value in col1
  - continue to add 2^64 until both conditions in the first if are true
- if value in col2 lower than previous value:
  - add 2^64 to value in col2
  - continue to add 2^64 until both conditions in the first if are true

I know that updating a Pandas Dataframe by iterating isn't healthy but I couldn't manage to do it without it. I wonder if this is possible without iterating through and updating rows.

Can you share some example data? Is the data sorted on the time column? — Cimbali
– Cimbali, Commented Aug 18, 2021 at 16:36
@Cimbali I will share in ten minutes. The data isn't sorted on the time column. The Time is actually the uptime and it can get down to zero if the device reboots. I will add more details. — bhdrozgn
– bhdrozgn, Commented Aug 18, 2021 at 16:48

Cimbali · Accepted Answer · 2021-08-18 16:52:00Z

So if I understand well, you’re saying that for any type value, col1 and col2 need to be monotonically increasing over time.

First we can construct a variable that’s True every time we go back in time or change type:

>>> df['time'].diff().lt(pd.Timedelta(0))
0      False
1      False
2      False
3      False
4      False
       ...  
115    False
116    False
117    False
118    False
119    False
Name: time, Length: 120, dtype: bool

By doing a cumulative sum on these variables, we can define one group per contiguous rows with increasing time and same type:

>>> ordered_sequence = df['time'].diff().lt(pd.Timedelta(0)).cumsum()
>>> type_sequence = df['type'].ne(df['type'].shift().fillna(df['type'])).cumsum()

Now within each group we want to add 2^64 every time values decrease. Again, we can use this cumsum technique, here with transform:

>>> df.groupby([ordered_sequence, type_sequence])[['col1', 'col2']].transform(lambda s: s + s.diff().lt(0).cumsum() * 2 ** 64)

here’s an example with smaller numbers (range 0..127) to demonstrate:

>>> df
                   time  col1  col2 type
0   2021-08-14 11:14:00    35    68    a
1   2021-08-14 11:15:00    70    31    a
2   2021-08-14 11:16:00    80   100    b
3   2021-08-14 11:17:00   117   119    c
4   2021-08-14 11:18:00    21    89    c
..                  ...   ...   ...  ...
115 2021-08-14 13:09:00   114    49    c
116 2021-08-14 13:10:00    45    91    a
117 2021-08-14 13:11:00    36   115    b
118 2021-08-14 13:12:00    66    14    a
119 2021-08-14 13:13:00    58    72    a

[120 rows x 4 columns]
>>> df.groupby([ordered_sequence, type_sequence])[['col1', 'col2']].transform(lambda s: s + s.diff().lt(0).cumsum() * 2 ** 6)
     col1  col2
0      35    68
1      70    95
2      80   100
3     117   119
4      85   153
..    ...   ...
115   114    49
116    45    91
117    36   115
118    66    14
119   122    72

[120 rows x 2 columns]

Thank you! I also noticed a corruption in my data thanks to your solution.

Collectives™ on Stack Overflow

Pandas Handling overflows without a loop

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related