0

I would like to create a column on my spark dataframe with operations on two columns.

I want to create the column Areas which is calculated with the formula:

( (Pct_Buenos_Acum[i]-Pct_Buenos_Acum[i-1]) * (Pct_Malos_Acum[i]+Pct_Malos_Acum[i-1]) ) / 2

I have tried this:

w = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)

df= df.withColumn('Areas', (( ( col('Pct_Acum_buenos')-col('Pct_Acum_buenos' ) )*(col('Pct_Acum_malos')+col('Pct_Acum_malos')))/2).over(w))

Find attached a print of what I have so far enter image description here

1 Answer 1

1

Here is a way to access previous values in pySpark. Going by that.

from pyspark.sql import functions as F

# adding indexs column to use in order by
df = df.withColumn('index', F.monotonicallyIncreasingId)

w = Window.partitionBy().orderBy('index')

df = df.withColumn('Areas', (((col('Pct_Acum_buenos')-F.lag(col('Pct_Acum_buenos')).over(w))*(col('Pct_Acum_malos')+F.lag(col('Pct_Acum_malos')).over(w)))/2)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.