1

I am getting many duplicated columns after joining two dataframes, now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

4 Answers 4

6

You have to use a vararg syntax to get the column names from an array and drop it. Check below:

scala> dfx.show
+---+---+---+---+------------+------+
|  A|  B|  C|  D|         arr|mincol|
+---+---+---+---+------------+------+
|  1|  2|  3|  4|[1, 2, 3, 4]|     A|
|  5|  4|  3|  1|[5, 4, 3, 1]|     D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update1:

scala>  val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
|  A|  B|  C|  D|  B|  C|
+---+---+---+---+---+---+
|  1|  2|  3|  4|  2|  3|
|  5|  4|  3|  1|  4|  3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
|  A|  B|  C|  D|  B|  C|
+---+---+---+---+---+---+
|  1|  2|  3|  4|  9|  9|
|  5|  4|  3|  1|  8|  8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala>  df4.show
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
|  1|  2|  3|  4|
|  5|  4|  3|  1|
+---+---+---+---+


scala>
Sign up to request clarification or add additional context in comments.

5 Comments

this solution drops both the duplicated columns, my requirement is to drop one and keep other
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
check my update2 on how to approach it dynamically. let me know if it helps
Is it possible to do alias for list of columns in one go?
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
1
  1. df.dropDuplicates() works only for rows.
  2. You can df1.drop(df2.column("value"))
  3. You can specify columns you want to select, for example, with df.select(Seq of columns)

3 Comments

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
Try df2.column("value") instead of new Column("value")
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
0

Suppose if you have two dataframes DF1 and DF2, You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

5 Comments

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
what do you mean by list of columns ? are they dynamically generated ?
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
i never tried this are you getting any any error while running this ? please do let me know
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
0

I wasn't completely satisfied with the answers in this. For the most part, especially @stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.

common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())

df_a.join(df_b.drop(*common_cols))

The select version of this looks similar but you have to add in the aliasing.

unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList 

df_a.alias("a")
    .join(df_b.alias("b"))
    .select(*keep_columns)

I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.