How to calculate the difference between rows in PySpark?

Question

This is my DataFrame in PySpark:

utc_timestamp               data    feed
2015-10-13 11:00:00+00:00   1       A
2015-10-13 12:00:00+00:00   5       A
2015-10-13 13:00:00+00:00   6       A
2015-10-13 14:00:00+00:00   10      B
2015-10-13 15:00:00+00:00   11      B

The values of data are cumulative.

I want to get this result (differences between consecutive rows, grouped by feed):

utc_timestamp               data    feed
2015-10-13 11:00:00+00:00   1       A
2015-10-13 12:00:00+00:00   4       A
2015-10-13 13:00:00+00:00   1       A  
2015-10-13 14:00:00+00:00   10      B
2015-10-13 15:00:00+00:00   1       B

In pandas I would do it this way:

df["data"] -= (df.groupby("feed")["data"].shift(fill_value=0))

How can I do the same thing in PySpark?

elyptikus · Accepted Answer · 2021-08-21 03:50:08Z

22

You can do this using lag function with a window:

from pyspark.sql.window import Window
import pyspark.sql.functions as f

window = Window.partitionBy("feed").orderBy("utc_timestamp")

df = df.withColumn("data", f.col("data") - f.lag(f.col("data"), 1, 0).over(window))

edited Aug 21, 2021 at 3:50

elyptikus

1,14811 silver badges27 bronze badges

answered Dec 1, 2020 at 22:13

Shadowtrooper

1,47216 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2020-12-02 05:42:12Z

9

You can use lag as a substitute for shift, and coalesce( , F.lit(0)) as a substitute for fill_value=0

from pyspark.sql.window import Window
import pyspark.sql.functions as F

window = Window.partitionBy("feed").orderBy("utc_timestamp")

data = F.col("data") - F.coalesce(F.lag(F.col("data")).over(window), F.lit(0))
df.withColumn("data", data)

answered Dec 2, 2020 at 5:42

mck

42.7k13 gold badges44 silver badges62 bronze badges

1 Comment

Shadowtrooper Over a year ago

You can avoid the use of coalesce setting in lag function the value 0 on default: lag(col, offset=1, default=0)

Collectives™ on Stack Overflow

How to calculate the difference between rows in PySpark?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related