3

I have a dataframe a:

id,value
1,11
2,22
3,33

And another dataframe b:

id,value
1,123
3,345

I want to update dataframe a with all matching values from b (based on column 'id').

Final dataframe 'c' would be:

id,value
1,123
2,22
3,345

How to achieve that using datafame joins (or other approach)?

Tried:

a.join(b, a.id == b.id, "inner").drop(a.value)

Gives (not desired output):

+---+---+-----+
| id| id|value|
+---+---+-----+
|  1|  1|  123|
|  3|  3|  345|
+---+---+-----+

Thanks.

1
  • it will cast you but it will get you the result . scala> dfd.join(df.select("id"),Seq("id"),"inner").union(df.join(dfd,Seq("id"),"left_anti")).orderBy("id").show Commented Oct 14, 2019 at 16:12

1 Answer 1

4

I don't think there is an update functionality. But this should work:

import pyspark.sql.functions as F

df1.join(df2, df1.id == df2.id, "left_outer") \
   .select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))
Sign up to request clarification or add additional context in comments.

2 Comments

This is close by looking at the logic but getting error raise TypeError("Column is not iterable")
Replaced column() with select()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.