Pandas: Add new dataframe column based on the dates of other smaller dataframe

Question

I have a dataframe that looks like this (link to csv):

time  ,  value
 0    ,   10
 1    ,   20
 2    ,   35
 3    ,   30
 4    ,   40
 5    ,   40
 6    ,   60

And I want to fill another column recentActive based on the values from this smaller dataframe (link to csv):

time  ,  value , activatedTime , deactivatedTime
 1    ,   20   ,      1        ,       5
 3    ,   30   ,      3        ,       4

In the recentActive column we should have the the most recent activated value that has not been deactivated yet. Once a value is deactivated, then we should fill it with the previous still active value. The final dataframe should look like this:

time  ,  value  ,  recentActive
 0    ,   10    ,      NaN
 1    ,   20    ,      20   (t=1 activated)
 2    ,   30    ,      20
 3    ,   30    ,      30   (t=3 activated)
 4    ,   40    ,      30   (t=3 deactivated)
 5    ,   40    ,      20   (t=1 deactivated)
 6    ,   60    ,      NaN  (no active values)

How can I do this? Preferably just using vectorized operations, thanks!

The bigger one will have around 15000 lines and the smaller one around 500 — mwind
– mwind, Commented Dec 13, 2022 at 10:23
See a suggestion below, It might not be bullet-proof, don't hesitate to provide feedback with example if you have cases for which it doesn't work — mozway
– mozway, Commented Dec 13, 2022 at 15:54

mozway · Accepted Answer · 2022-12-13 16:02:06Z

2

It's a bit complex to achieve if you want a performant solution.

You can build an IntervalIndex, including a "catch-all" interval (min-max, else the slicing will fail on missing values), then slice and aggregate the potential multiple intervals matches with groupby.last to keep only the first one per initial value.

This assumes df1 and df2 as inputs and requires df2 to be sorted on activatedTime.

import numpy as np

idx = pd.IntervalIndex.from_arrays(np.r_[df1['time'].min(), df2['activatedTime']],
                                   np.r_[df1['time'].max(), df2['deactivatedTime']],
                                   closed='both')
intervals = pd.Series(np.r_[np.nan, df2['value']]).set_axis(idx)

s = intervals.loc[df1['time']]
# make groups if intervals are increasing
group = s.index.left.to_series().diff().le(0).cumsum()
df1['recentActive'] = s.groupby(group.to_numpy()).last()

Output:

   time  value  recentActive
0     0     10           NaN
1     1     20          20.0
2     2     35          20.0
3     3     30          30.0
4     4     40          30.0
5     5     40          20.0
6     6     60           NaN

edited Dec 13, 2022 at 16:02

answered Dec 13, 2022 at 15:49

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mwind Over a year ago

Thanks a lot for your answer! I will test this solution and read more about some of the functions that you are using here that I'm not very familiar with. I will comeback with some feedback later today

mwind Over a year ago

I just marked it as the solution. I took some time to breakdown and understand all the code. I'm amazed on how you thought about this, thanks again for taking the time to help me out

Collectives™ on Stack Overflow

Pandas: Add new dataframe column based on the dates of other smaller dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related