I have two dataframes, df1 and df2 and going to perform join operation on the basis of id column.
scala> val df1 = Seq((1,"mahesh"), (2,"shivangi"),(3,"manoj")).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> df1.show
+---+--------+
| id| name|
+---+--------+
| 1| mahesh|
| 2|shivangi|
| 3| manoj|
+---+--------+
scala> val df2 = Seq((1,24), (2,23),(3,24)).toDF("id", "age")
df2: org.apache.spark.sql.DataFrame = [id: int, age: int]
scala> df2.show
+---+---+
| id|age|
+---+---+
| 1| 24|
| 2| 23|
| 3| 24|
+---+---+
Here is an incorrect solution, where the join columns are defined as the predicate.
df1("id") === df2("id")
The incorrect result is that the id column is duplicated in the joined data frame:
scala> df1.join(df2, df1("id") === df2("id"), "left").show
+---+--------+---+---+
| id| name| id|age|
+---+--------+---+---+
| 1| mahesh| 1| 24|
| 2|shivangi| 2| 23|
| 3| manoj| 3| 24|
+---+--------+---+---+
The correct solution is to define the join columns as an array of strings Seq("id") instead of expression. Then joined dataframe does not have duplicate columns.
scala> df1.join(df2, Seq("id"),"left").show
+---+--------+---+
| id| name|age|
+---+--------+---+
| 1| mahesh| 24|
| 2|shivangi| 23|
| 3| manoj| 24|
+---+--------+---+
For more information, you can refer from here