1

I have a dataframe that looks like this (link to csv):

time  ,  value
 0    ,   10
 1    ,   20
 2    ,   35
 3    ,   30
 4    ,   40
 5    ,   40
 6    ,   60

And I want to fill another column recentActive based on the values from this smaller dataframe (link to csv):

time  ,  value , activatedTime , deactivatedTime
 1    ,   20   ,      1        ,       5
 3    ,   30   ,      3        ,       4

In the recentActive column we should have the the most recent activated value that has not been deactivated yet. Once a value is deactivated, then we should fill it with the previous still active value. The final dataframe should look like this:

time  ,  value  ,  recentActive
 0    ,   10    ,      NaN
 1    ,   20    ,      20   (t=1 activated)
 2    ,   30    ,      20
 3    ,   30    ,      30   (t=3 activated)
 4    ,   40    ,      30   (t=3 deactivated)
 5    ,   40    ,      20   (t=1 deactivated)
 6    ,   60    ,      NaN  (no active values)

How can I do this? Preferably just using vectorized operations, thanks!

4
  • how big are each dataframe in real-life? Commented Dec 13, 2022 at 8:50
  • The bigger one will have around 15000 lines and the smaller one around 500 Commented Dec 13, 2022 at 10:23
  • @mozway any suggestion on how to do this? Commented Dec 13, 2022 at 14:59
  • See a suggestion below, It might not be bullet-proof, don't hesitate to provide feedback with example if you have cases for which it doesn't work Commented Dec 13, 2022 at 15:54

1 Answer 1

2

It's a bit complex to achieve if you want a performant solution.

You can build an IntervalIndex, including a "catch-all" interval (min-max, else the slicing will fail on missing values), then slice and aggregate the potential multiple intervals matches with groupby.last to keep only the first one per initial value.

This assumes df1 and df2 as inputs and requires df2 to be sorted on activatedTime.

import numpy as np

idx = pd.IntervalIndex.from_arrays(np.r_[df1['time'].min(), df2['activatedTime']],
                                   np.r_[df1['time'].max(), df2['deactivatedTime']],
                                   closed='both')
intervals = pd.Series(np.r_[np.nan, df2['value']]).set_axis(idx)

s = intervals.loc[df1['time']]
# make groups if intervals are increasing
group = s.index.left.to_series().diff().le(0).cumsum()
df1['recentActive'] = s.groupby(group.to_numpy()).last()

Output:

   time  value  recentActive
0     0     10           NaN
1     1     20          20.0
2     2     35          20.0
3     3     30          30.0
4     4     40          30.0
5     5     40          20.0
6     6     60           NaN
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot for your answer! I will test this solution and read more about some of the functions that you are using here that I'm not very familiar with. I will comeback with some feedback later today
I just marked it as the solution. I took some time to breakdown and understand all the code. I'm amazed on how you thought about this, thanks again for taking the time to help me out

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.