spark drop multiple duplicated columns after join

Question

I am getting many duplicated columns after joining two dataframes, now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

stack0114106 · Accepted Answer · 2018-11-15 14:24:25Z

6

You have to use a vararg syntax to get the column names from an array and drop it. Check below:

scala> dfx.show
+---+---+---+---+------------+------+
|  A|  B|  C|  D|         arr|mincol|
+---+---+---+---+------------+------+
|  1|  2|  3|  4|[1, 2, 3, 4]|     A|
|  5|  4|  3|  1|[5, 4, 3, 1]|     D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update1:

scala>  val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
|  A|  B|  C|  D|  B|  C|
+---+---+---+---+---+---+
|  1|  2|  3|  4|  2|  3|
|  5|  4|  3|  1|  4|  3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
|  A|  B|  C|  D|  B|  C|
+---+---+---+---+---+---+
|  1|  2|  3|  4|  9|  9|
|  5|  4|  3|  1|  8|  8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala>  df4.show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

edited Nov 15, 2018 at 14:24

answered Nov 15, 2018 at 13:01

stack0114106

8,8934 gold badges16 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

scoder Over a year ago

this solution drops both the duplicated columns, my requirement is to drop one and keep other

stack0114106 Over a year ago

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

stack0114106 Over a year ago

check my update2 on how to approach it dynamically. let me know if it helps

scoder Over a year ago

Is it possible to do alias for list of columns in one go?

stack0114106 Over a year ago

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

Serge Harnyk · Accepted Answer · 2018-11-15 12:52:35Z

1

df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15, 2018 at 12:52

Serge Harnyk

1,34911 silver badges20 bronze badges

3 Comments

scoder Over a year ago

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

Serge Harnyk Over a year ago

Try df2.column("value") instead of new Column("value")

scoder Over a year ago

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

Sundeep · Accepted Answer · 2018-11-15 10:19:59Z

0

Suppose if you have two dataframes DF1 and DF2, You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15, 2018 at 10:19

Sundeep

2,5053 gold badges23 silver badges48 bronze badges

5 Comments

scoder Over a year ago

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

Sundeep Over a year ago

what do you mean by list of columns ? are they dynamically generated ?

scoder Over a year ago

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

Sundeep Over a year ago

i never tried this are you getting any any error while running this ? please do let me know

scoder Over a year ago

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

Robert Beatty · Accepted Answer · 2020-04-14 01:40:07Z

I wasn't completely satisfied with the answers in this. For the most part, especially @stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.

common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())

df_a.join(df_b.drop(*common_cols))

The select version of this looks similar but you have to add in the aliasing.

unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList 

df_a.alias("a")
    .join(df_b.alias("b"))
    .select(*keep_columns)

I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

Collectives™ on Stack Overflow

spark drop multiple duplicated columns after join

4 Answers 4

5 Comments

3 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

3 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related