0

I have the below DF with me

+------+------+----+
|  Year|    PY| VAL|
+------+------+----+
|202005|201905|2005|
|202006|201906|2006|
|202007|201907|2007|
|201905|201805|1905|
|201906|201806|1906|
|201907|201807|1907|
|201805|201705|1805|
|201806|201706|1806|
|201807|201707|1807|
+------+------+----+

obtained by

val df1=Seq(
("202005","201905","2005"),
("202006","201906","2006"),
("202007","201907","2007"),
("201905","201805","1905"),
("201906","201806","1906"),
("201907","201807","1907"),
("201805","201705","1805"),
("201806","201706","1806"),
("201807","201707","1807")
)toDF("Year","PY","VAL")

I would like to populate the Value of previous year(VAL_PY) in a separate column. That value actually resides in a different row in the same DF.

Also , I would like to achieve this in a distributed way as my DF is a big one (> 10 million records)

Expected output --

+------+------+----+-------+
|  Year|    PY| VAL| VAL_PY|
+------+------+----+-------+
|202005|201905|2005|1905   |
|202006|201906|2006|1906   |
|202007|201907|2007|1907   |
|201905|201805|1905|1805   |
|201906|201806|1906|1806   |
|201907|201807|1907|1807   |
|201805|201705|1805|null   |
|201806|201706|1806|null   |
|201807|201707|1807|null   |
+------+------+----+-------+
4
  • What is the logic for populating the new column values? Commented Oct 19, 2020 at 15:53
  • VAL_PY - is the value of the previous year, which is there in the same DF but in different row Commented Oct 19, 2020 at 16:02
  • Yes, but how do you determine that row? Commented Oct 19, 2020 at 16:04
  • Ex - In the first ROW we have year = 202005 and PY = 201905. Hence VAL_PY = VAL when Year = 201905 Commented Oct 19, 2020 at 16:25

1 Answer 1

2
val df1=Seq(("202005","201905","2005"),("202006","201906","2006"),("202007","201907","2007"),("201905","201805","1905"),("201906","201806","1906"),("201907","201807","1907"),("201805","201705","1805"),("201806","201706","1806"),("201807","201707","1807")
)toDF("Year","PY","VAL")

val df2 = df1
.drop("PY")
.withColumnRenamed("VAL","VAL_PY")
.withColumnRenamed("Year","PY")

df1.join(df2, Seq("PY"),"left")
.select("Year","PY","VAL","VAL_PY").show

OUTPUT :

+------+------+----+------+
|  Year|    PY| VAL|VAL_PY|
+------+------+----+------+
|202005|201905|2005|  1905|
|202006|201906|2006|  1906|
|202007|201907|2007|  1907|
|201905|201805|1905|  1805|
|201906|201806|1906|  1806|
|201907|201807|1907|  1807|
|201805|201705|1805|  null|
|201806|201706|1806|  null|
|201807|201707|1807|  null|
+------+------+----+------+

Seemed like a left self join. Please let me know if I am missing something.

Sign up to request clarification or add additional context in comments.

9 Comments

But do you know why this fails - val df2 = df1.drop("PY") ; df1.join(df2,df1("PY") === df2("VAL"))
This seems really interesting for me. :o
What error are you getting? Syntax wise it's correct. Logically it won't match as PY and VAL , as no common values hence you should get an empty DF
sorry!! I meant this one .Earlier was a typo . val df2 = df1.drop("PY") ; df1.join(df2,df1("PY") === df2("Year"))
Error - Use the CROSS JOIN syntax to allow cartesian products between these relations.; But when I used - spark.conf.set("spark.sql.crossJoin.enabled", "true"), I am getting NULL in the right DF.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.