Derived column in pySpark using two columns and previous row's value

Question

I would like to create a column on my spark dataframe with operations on two columns.

I want to create the column Areas which is calculated with the formula:

( (Pct_Buenos_Acum[i]-Pct_Buenos_Acum[i-1]) * (Pct_Malos_Acum[i]+Pct_Malos_Acum[i-1]) ) / 2

I have tried this:

w = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)

df= df.withColumn('Areas', (( ( col('Pct_Acum_buenos')-col('Pct_Acum_buenos' ) )*(col('Pct_Acum_malos')+col('Pct_Acum_malos')))/2).over(w))

Find attached a print of what I have so far

Rahul Chawla · Accepted Answer · 2018-12-20 10:36:01Z

1

Here is a way to access previous values in pySpark. Going by that.

from pyspark.sql import functions as F

# adding indexs column to use in order by
df = df.withColumn('index', F.monotonicallyIncreasingId)

w = Window.partitionBy().orderBy('index')

df = df.withColumn('Areas', (((col('Pct_Acum_buenos')-F.lag(col('Pct_Acum_buenos')).over(w))*(col('Pct_Acum_malos')+F.lag(col('Pct_Acum_malos')).over(w)))/2)

edited Dec 20, 2018 at 10:36

answered Dec 20, 2018 at 8:31

Rahul Chawla

1,08810 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Derived column in pySpark using two columns and previous row's value

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related