Fetching value from a different ROW in a spark dataframe

Question

I have the below DF with me

+------+------+----+
|  Year|    PY| VAL|
+------+------+----+
|202005|201905|2005|
|202006|201906|2006|
|202007|201907|2007|
|201905|201805|1905|
|201906|201806|1906|
|201907|201807|1907|
|201805|201705|1805|
|201806|201706|1806|
|201807|201707|1807|
+------+------+----+

obtained by

val df1=Seq(
("202005","201905","2005"),
("202006","201906","2006"),
("202007","201907","2007"),
("201905","201805","1905"),
("201906","201806","1906"),
("201907","201807","1907"),
("201805","201705","1805"),
("201806","201706","1806"),
("201807","201707","1807")
)toDF("Year","PY","VAL")

I would like to populate the Value of previous year(VAL_PY) in a separate column. That value actually resides in a different row in the same DF.

Also , I would like to achieve this in a distributed way as my DF is a big one (> 10 million records)

Expected output --

+------+------+----+-------+
|  Year|    PY| VAL| VAL_PY|
+------+------+----+-------+
|202005|201905|2005|1905   |
|202006|201906|2006|1906   |
|202007|201907|2007|1907   |
|201905|201805|1905|1805   |
|201906|201806|1906|1806   |
|201907|201807|1907|1807   |
|201805|201705|1805|null   |
|201806|201706|1806|null   |
|201807|201707|1807|null   |
+------+------+----+-------+

VAL_PY - is the value of the previous year, which is there in the same DF but in different row — abc_spark
– abc_spark, Commented Oct 19, 2020 at 16:02
Ex - In the first ROW we have year = 202005 and PY = 201905. Hence VAL_PY = VAL when Year = 201905 — abc_spark
– abc_spark, Commented Oct 19, 2020 at 16:25

Sanket9394 · Accepted Answer · 2020-10-19 18:02:31Z

2

val df1=Seq(("202005","201905","2005"),("202006","201906","2006"),("202007","201907","2007"),("201905","201805","1905"),("201906","201806","1906"),("201907","201807","1907"),("201805","201705","1805"),("201806","201706","1806"),("201807","201707","1807")
)toDF("Year","PY","VAL")

val df2 = df1
.drop("PY")
.withColumnRenamed("VAL","VAL_PY")
.withColumnRenamed("Year","PY")

df1.join(df2, Seq("PY"),"left")
.select("Year","PY","VAL","VAL_PY").show

OUTPUT :

+------+------+----+------+
|  Year|    PY| VAL|VAL_PY|
+------+------+----+------+
|202005|201905|2005|  1905|
|202006|201906|2006|  1906|
|202007|201907|2007|  1907|
|201905|201805|1905|  1805|
|201906|201806|1906|  1806|
|201907|201807|1907|  1807|
|201805|201705|1805|  null|
|201806|201706|1806|  null|
|201807|201707|1807|  null|
+------+------+----+------+

Seemed like a left self join. Please let me know if I am missing something.

edited Oct 19, 2020 at 18:02

answered Oct 19, 2020 at 17:54

Sanket9394

2,1111 gold badge13 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

abc_spark Over a year ago

But do you know why this fails - val df2 = df1.drop("PY") ; df1.join(df2,df1("PY") === df2("VAL"))

abc_spark Over a year ago

This seems really interesting for me. :o

Sanket9394 Over a year ago

What error are you getting? Syntax wise it's correct. Logically it won't match as PY and VAL , as no common values hence you should get an empty DF

abc_spark Over a year ago

sorry!! I meant this one .Earlier was a typo . val df2 = df1.drop("PY") ; df1.join(df2,df1("PY") === df2("Year"))

abc_spark Over a year ago

Error - Use the CROSS JOIN syntax to allow cartesian products between these relations.; But when I used - spark.conf.set("spark.sql.crossJoin.enabled", "true"), I am getting NULL in the right DF.

|

Collectives™ on Stack Overflow

Fetching value from a different ROW in a spark dataframe

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related