Calculating column value based on previous row and column using lambda function

Question

I have this pandas dataframe that looks like this:

index up_walk    down_walk   up_avg  down_avg
  0   0.000000   17.827148  0.36642   9.06815
  1   1.550781    0.000000      NaN       NaN
  2   0.957031    0.000000      NaN       NaN
  3   0.000000    2.878906      NaN       NaN

I wanted to calculate the missing values that currently are NAN by this formula:

df['up_avg'][i] = df['up_avg'][i-1] * 12 + df['up_walk'][i]

explanation: I want to calculate for every row the value based on the previous row in the same column, plus the value in the current row from a different column. And that for every row with missing values. and Continue this calculation to the end of the dataframe. In this case, I have a dependency in every new row calculation that is based on the previous up_avg value calculation.

The problem is that using a loop is very slow because of the large dataframe(10K)

Can anyone please help implement a lambda function for this?

if this is not possible, can anyone share a script for an efficient loop?

I tried a lot of things with no success like this:

df['up_avg'] = df.apply(lambda x: pd.Series(np.where((x.up_avg != None), x.up_avg.shift() * 12 + x.up_walk, x.up_avg)))

got an error -  "AttributeError: 'Series' object has no attribute 'up_avg'"

and also using shift to create new columns and then using a lambda function with no success

I expect that my dataframe will look like this at the end:

index up_walk    down_walk   up_avg  down_avg
  0   0.000000   17.827148  0.36642   9.06815
  1   1.550781    0.000000  5.947821  108.8178
  2   0.957031    0.000000  72.330883 1305.8136
  3   0.000000    2.878906  867.970596  15672.642106

Thanks a lot!

Bushmaster · Accepted Answer · 2022-11-12 04:37:44Z

1

you can use np.roll instead of shift. Also, if you are using apply, you must specify an axis:

#keep going until there is no nan value left
status=True
while status:
    df['up_avg'] = np.where((np.isnan(df.up_avg)==True), np.roll(df.up_avg,1) * 12 +df.up_walk ,df.up_avg)
    if df['up_avg'].isnull().sum() == 0:
        status=False
        
status=True
while status:
    df['down_avg'] = np.where((np.isnan(df.down_avg)==True), np.roll(df.down_avg,1) * 12 +df.down_walk ,df.down_avg)
    if df['down_avg'].isnull().sum() == 0:
        status=False

print(df)
    up_walk   down_walk     up_avg      down_avg
0   0.0       17.827148     0.36642     9.06815
1   1.550781  0.0           5.947821    108.81779999999999
2   0.957031  0.0           72.330883   1305.8136
3   0.0       2.878906      867.970596  15672.642106

```

edited Nov 12, 2022 at 4:37

answered Nov 11, 2022 at 22:11

Bushmaster

4,6364 gold badges11 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Vin · Accepted Answer · 2022-11-11 23:30:38Z

Based on the math you're trying to implement here, each missing nan value is given by:

up_avg1 = 12*up_avg0 + up_walk1
up_avg2 = 12*up_avg1 + up_walk2
up_avg3 = 12*up_avg2 + up_walk3

...and so on. Expressed in this way, each new value of up_avg depends on the previous value of up_avg, which forces you to loop.

Unpacking this, we recognise that:

up_avg1 = (12**1)*up_avg0 + (12**0)*up_walk1
up_avg2 = (12**2)*up_avg0 + (12**1)*up_walk1 + (12**0)*up_walk2 
up_avg3 = (12**3)*up_avg0 + (12**2)*up_walk1 + (12**1)*up_walk2  + (12**0)*up_walk3

...and so on. This allows us to express all your unknown values of up_avg (your nan values) as the product of some calculation relying on three things, all of which you know at the outset:

Your constant (12)
Your first known value of up_avg (up_avg0 = 0.36642)
all your known values of up_walk (up_walk1, up_walk2, up_walk3, etc).

Like this:

[up_avg1]             [12  ]     [1    0    0]   [up_walk1]
[up_avg2] = up_avg0 * [144 ]  +  [12   1    0] * [up_walk2]  
[up_avg3]             [1728]     [144  12   1]   [up_walk3]

Therefore, instead of looping and calculating each nan value one by one, you can express this math as (basically) a single step matrix algebra problem and solve it that way - kinda like this.

Fair warning - I'm not an expert in numpy or pandas - so the implementation here might be clumsy - the rationalisation of the math is what I'm trying to get across.

import numpy as np
import pandas as pd
from scipy.linalg import toeplitz
   
num_rows_in_dataframe = df['index'].size
# sets num_rows_in_dataframe to 4
constant = 12
# this is the value you want to multiply by each "previous" up_avg value
up_avg_1 = df['up_avg'][0]
# this is the first up_avg value you have = 0.36642

toeplitz_c = np.arange(num_rows_in_dataframe-1)
toeplitz_r = np.hstack((np.array([1]), np.zeros((num_rows_in_dataframe-2))))
powers = toeplitz(toeplitz_c, toeplitz_r)
# these three rows basically constuct this matrix:
    # [0., 0., 0.]
    # [1., 0., 0.]
    # [2., 1., 0.]
    
# We then raise your constant to the powers in this matrix:
constant_array = constant**powers
# which gives us:
    # [  1.,   1.,   1.]
    # [ 12.,   1.,   1.]
    # [144.,  12.,   1.]

# We then take the bottom triangle of this matrix:
constant_array = np.tril(constant_array)

# Giving us this matrix:
    # [  1.,   0.,   0.]
    # [ 12.,   1.,   0.]
    # [144.,  12.,   1.]
# We pick up all the "same row" values of up_walk and place them in a vector:
up_walk = np.array(df['up_walk'])[1:][:, np.newaxis]
    # [1.550781]
    # [0.957031]
    # [0.      ]

# And finally, putting it all together:
replace_nans_with = np.matmul(constant_array, up_walk) + (up_avg_1*constant**np.arange(1, num_rows_in_dataframe))[:, np.newaxis]

# We get an array of your missing nans:
    # [  5.947821]
    # [ 72.330883]
    # [867.970596]

Finally put this vector of values in place of your nan values and you're home.

This isn't really the "efficient loop" you were after, but (unless I'm missing something) it is a "non-loopy" way to solve the problem and should help you do the job faster - I'd be keen to know if running it this way does in fact prove faster - please try it and let me know.

wow, amazing approach to this problem. I tried and yes this runs faster than a loop. the only problem is that you need to use a small constant for calculating large dataframes.
True, the constant powers will become very large very fast. You could only get around it by breaking your dataset into big chunks. Anyway thanks for trying it!

Collectives™ on Stack Overflow

Calculating column value based on previous row and column using lambda function

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related