PySpark - how to update Dataframe by using join?

Question

I have a dataframe a:

id,value
1,11
2,22
3,33

And another dataframe b:

id,value
1,123
3,345

I want to update dataframe a with all matching values from b (based on column 'id').

Final dataframe 'c' would be:

id,value
1,123
2,22
3,345

How to achieve that using datafame joins (or other approach)?

Tried:

a.join(b, a.id == b.id, "inner").drop(a.value)

Gives (not desired output):

+---+---+-----+
| id| id|value|
+---+---+-----+
|  1|  1|  123|
|  3|  3|  345|
+---+---+-----+

Thanks.

it will cast you but it will get you the result . scala> dfd.join(df.select("id"),Seq("id"),"inner").union(df.join(dfd,Seq("id"),"left_anti")).orderBy("id").show — Mahesh Gupta
– Mahesh Gupta, Commented Oct 14, 2019 at 16:12

jho · Accepted Answer · 2019-10-14 14:29:55Z

4

I don't think there is an update functionality. But this should work:

import pyspark.sql.functions as F

df1.join(df2, df1.id == df2.id, "left_outer") \
   .select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

edited Oct 14, 2019 at 14:29

answered Oct 14, 2019 at 13:37

jho

7751 gold badge6 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joe Over a year ago

This is close by looking at the logic but getting error raise TypeError("Column is not iterable")

jho Over a year ago

Replaced column() with select()

Collectives™ on Stack Overflow

PySpark - how to update Dataframe by using join?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related