1

I think this question similar to some other questions, but it is not asked.

In Spark, how can we run SQL query with duplicate column removed?

For example, a SQL query running on spark

select a.* from a
left outer join
   select b.* from b
on a.id = b.id  

how can I remove the duplicated column b.id in this case?

I know we can use additional steps in Spark, such as providing alas or rename columns, but is there a faster way to remove the duplicated columns simply by writing SQL querys?

2 Answers 2

4

I have two dataframes, df1 and df2 and going to perform join operation on the basis of id column.

scala> val df1  = Seq((1,"mahesh"), (2,"shivangi"),(3,"manoj")).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> df1.show
+---+--------+
| id|    name|
+---+--------+
|  1|  mahesh|
|  2|shivangi|
|  3|   manoj|
+---+--------+

scala> val df2  = Seq((1,24), (2,23),(3,24)).toDF("id", "age")
df2: org.apache.spark.sql.DataFrame = [id: int, age: int]

scala> df2.show
+---+---+
| id|age|
+---+---+
|  1| 24|
|  2| 23|
|  3| 24|
+---+---+

Here is an incorrect solution, where the join columns are defined as the predicate.

df1("id") === df2("id")

The incorrect result is that the id column is duplicated in the joined data frame:

scala> df1.join(df2, df1("id") === df2("id"), "left").show
+---+--------+---+---+
| id|    name| id|age|
+---+--------+---+---+
|  1|  mahesh|  1| 24|
|  2|shivangi|  2| 23|
|  3|   manoj|  3| 24|
+---+--------+---+---+

The correct solution is to define the join columns as an array of strings Seq("id") instead of expression. Then joined dataframe does not have duplicate columns.

scala> df1.join(df2, Seq("id"),"left").show
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  1|  mahesh| 24|
|  2|shivangi| 23|
|  3|   manoj| 24|
+---+--------+---+

For more information, you can refer from here

Sign up to request clarification or add additional context in comments.

Comments

0

Since Spark 1.4.0, you can use join in two ways, columns or joinExprs. When using first way, the join columns will only appear once in the output.

/**
 * Inner equi-join with another [[DataFrame]] using the given columns.
 *
 * Different from other join functions, the join columns will only appear once in the output,
 * i.e. similar to SQL's `JOIN USING` syntax.
 *
 * {{{
 *   // Joining df1 and df2 using the columns "user_id" and "user_name"
 *   df1.join(df2, Seq("user_id", "user_name"))
 * }}}
 *
 * Note that if you perform a self-join using this function without aliasing the input
 * [[DataFrame]]s, you will NOT be able to reference any columns after the join, since
 * there is no way to disambiguate which side of the join you would like to reference.
 *
 * @param right Right side of the join operation.
 * @param usingColumns Names of the columns to join on. This columns must exist on both sides.
 * @group dfops
 * @since 1.4.0
 */

def join(right: DataFrame, usingColumns: Seq[String]): DataFrame = {
  join(right, usingColumns, "inner")
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.