Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Question

Let's say I have a rather large dataset in the following form:

data = sc.parallelize([('Foo', 41, 'US', 3),
                       ('Foo', 39, 'UK', 1),
                       ('Bar', 57, 'CA', 2),
                       ('Bar', 72, 'CA', 2),
                       ('Baz', 22, 'US', 6),
                       ('Baz', 36, 'US', 6)])

I would like to remove duplicate rows based on the values of the first, third and fourth columns only.

Removing entirely duplicate rows is straightforward:

data = data.distinct()

and either row 5 or row 6 will be removed.

But how do I only remove duplicate rows based on columns 1, 3 and 4 only? I.e. remove either one one of these:

('Baz', 22, 'US', 6)
('Baz', 36, 'US', 6)

In Python, this could be done by specifying columns with .drop_duplicates(). How can I achieve the same in Spark/PySpark?

What is the sample code in? Scala? Python?

Peter Mortensen
– Peter Mortensen

2023-11-09 20:34:38 +00:00
Commented Nov 9, 2023 at 20:34 — Peter Mortensen
– Peter Mortensen, Commented Nov 9, 2023 at 20:34

Peter Mortensen · Accepted Answer · 2023-11-09 20:30:49Z

149

PySpark does include a dropDuplicates() method, which was introduced in 1.4.

>>> from pyspark.sql import Row
>>> df = sc.parallelize([ \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=5, height=80), \
...     Row(name='Alice', age=10, height=80)]).toDF()
>>> df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

>>> df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

edited Nov 9, 2023 at 20:30

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Oct 5, 2016 at 18:07

vaer-k

11.9k12 gold badges47 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user422930 Over a year ago

Is there a way to capture the records that it did drop?

Rodney Over a year ago

x = usersDf.drop_duplicates(subset=['DETUserId']) - X dataframe will be all the dropped records

Bas Over a year ago

@Rodney That is not what the documentation says: "Return a new DataFrame with duplicate rows removed, optionally only considering certain columns." spark.apache.org/docs/2.1.0/api/python/…

Alessandro S. Over a year ago

The result is non-deterministic, you most probably don't want to use that in production...

analyticalpicasso · Accepted Answer · 2016-04-18 12:40:06Z

28

From your question, it is unclear as-to which columns you want to use to determine duplicates. The general idea behind the solution is to create a key based on the values of the columns that identify duplicates. Then, you can use the reduceByKey or reduce operations to eliminate duplicates.

Here is some code to get you started:

def get_key(x):
    return "{0}{1}{2}".format(x[0],x[2],x[3])

m = data.map(lambda x: (get_key(x),x))

Now, you have a key-value RDD that is keyed by columns 1,3 and 4. The next step would be either a reduceByKey or groupByKey and filter. This would eliminate duplicates.

r = m.reduceByKey(lambda x,y: (x))

edited Apr 18, 2016 at 12:40

analyticalpicasso

1,9938 gold badges26 silver badges45 bronze badges

answered May 15, 2015 at 1:16

Mike

2833 silver badges5 bronze badges

Comments

David Griffin · Accepted Answer · 2015-05-15 10:54:27Z

19

I know you already accepted the other answer, but if you want to do this as a DataFrame, just use groupBy and agg. Assuming you had a DF already created (with columns named "col1", "col2", etc) you could do:

myDF.groupBy($"col1", $"col3", $"col4").agg($"col1", max($"col2"), $"col3", $"col4")

Note that in this case, I chose the Max of col2, but you could do avg, min, etc.

answered May 15, 2015 at 10:54

David Griffin

14k5 gold badges49 silver badges66 bronze badges

2 Comments

David Griffin Over a year ago

So far, my experience with DataFrames is that they make everything more elegant and a lot faster.

Daniel Arthur Over a year ago

It should be noted that this answer is written in Scala - for pyspark replace $"col1" with col("col1") etc.

technotring · Accepted Answer · 2015-09-10 13:04:06Z

14

Agree with David. To add on, it may not be the case that we want to groupBy all columns other than the column(s) in aggregate function i.e, if we want to remove duplicates purely based on a subset of columns and retain all columns in the original dataframe. So the better way to do this could be using dropDuplicates Dataframe api available in Spark 1.4.0

For reference, see: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame

answered Sep 10, 2015 at 13:04

technotring

1971 silver badge5 bronze badges

1 Comment

sag Over a year ago

Do we have corresponding function in SparkR?

Peter Mortensen · Accepted Answer · 2023-11-09 20:11:35Z

10

I used the inbuilt function dropDuplicates(). The Scala code is given below:

val data = sc.parallelize(List(("Foo",41,"US",3),
("Foo",39,"UK",1),
("Bar",57,"CA",2),
("Bar",72,"CA",2),
("Baz",22,"US",6),
("Baz",36,"US",6))).toDF("x","y","z","count")

data.dropDuplicates(Array("x","count")).show()

Output:

+---+---+---+-----+
|  x|  y|  z|count|
+---+---+---+-----+
|Baz| 22| US|    6|
|Foo| 39| UK|    1|
|Foo| 41| US|    3|
|Bar| 57| CA|    2|
+---+---+---+-----+

edited Nov 9, 2023 at 20:11

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Nov 18, 2016 at 3:11

Aravind Krishnakumar

2,7771 gold badge31 silver badges27 bronze badges

1 Comment

vaer-k Over a year ago

The question specifically asks for pyspark implementation, not scala

Peter Mortensen · Accepted Answer · 2023-11-09 20:17:56Z

5

The below programme will help you drop duplicates on whole, or if you want to drop duplicates based on certain columns, you can even do that:

import org.apache.spark.sql.SparkSession

object DropDuplicates {

    def main(args: Array[String]) {
        val spark =
            SparkSession.builder()
                .appName("DataFrame-DropDuplicates")
                .master("local[4]")
                .getOrCreate()

        import spark.implicits._

        // Create an RDD of tuples with some data
        val custs = Seq(
            (1, "Widget Co", 120000.00, 0.00, "AZ"),
            (2, "Acme Widgets", 410500.00, 500.00, "CA"),
            (3, "Widgetry", 410500.00, 200.00, "CA"),
            (4, "Widgets R Us", 410500.00, 0.0, "CA"),
            (3, "Widgetry", 410500.00, 200.00, "CA"),
            (5, "Ye Olde Widgete", 500.00, 0.0, "MA"),
            (6, "Widget Co", 12000.00, 10.00, "AZ")
        )
        val customerRows = spark.sparkContext.parallelize(custs, 4)

        // Convert RDD of tuples to DataFrame by supplying column names
        val customerDF = customerRows.toDF("id", "name", "sales", "discount", "state")

        println("*** Here's the whole DataFrame with duplicates")

        customerDF.printSchema()

        customerDF.show()

        // Drop fully identical rows
        val withoutDuplicates = customerDF.dropDuplicates()

        println("*** Now without duplicates")

        withoutDuplicates.show()

        val withoutPartials = customerDF.dropDuplicates(Seq("name", "state"))

        println("*** Now without partial duplicates too")

        withoutPartials.show()
    }

}

edited Nov 9, 2023 at 20:17

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered May 21, 2018 at 14:37

Sampat Kumar

5021 gold badge6 silver badges14 bronze badges

2 Comments

Joshua Stafford Over a year ago

The comment "// drop fully identical rows" is correct the first time, and incorrect the second time. Perhaps a copy/paste error?

Sampat Kumar Over a year ago

Thanks @JoshuaStafford , removed the bad comment.

Peter Mortensen · Accepted Answer · 2023-11-09 20:35:39Z

All the approaches in previous answers are good, and I feel dropduplicates is the best approach.

Below is another way (group by agg, etc..) to drop duplicates without using dropduplicates, but if you note the time/performance, dropduplicates by columns is the champion (time taken: 1563 ms).

Below is the full listing and times

import org.apache.spark.sql.SparkSession

object DropDups {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("ReadFromUrl")
      .master("local[*]")
      .getOrCreate()

    val sc = spark.sparkContext
    import spark.implicits._
    spark.sparkContext.setLogLevel("Error")
    val data = sc.parallelize(List(
      ("Foo", 41, "US", 3),
      ("Foo", 39, "UK", 1),
      ("Bar", 57, "CA", 2),
      ("Bar", 72, "CA", 2),
      ("Baz", 22, "US", 6),
      ("Baz", 36, "US", 6)
    )).toDF("x", "y", "z", "count")

    spark.time
    {
      import org.apache.spark.sql.functions.first
      val data = sc.parallelize(List(
        ("Foo", 41, "US", 3),
        ("Foo", 39, "UK", 1),
        ("Bar", 57, "CA", 2),
        ("Bar", 72, "CA", 2),
        ("Baz", 22, "US", 6),
        ("Baz", 36, "US", 6)
      )).toDF("x", "y", "z", "count")

      val deduped = data
        .groupBy("x", "count")
        .agg(
          first("y").as("y"),
          first("z").as("z")
        )
      deduped.show()
    }
    spark.time {
      data.dropDuplicates(Array("x","count")).show()
    }
    spark.stop()
  }
}

Result:

+---+-----+---+---+
|  x|count|  y|  z|
+---+-----+---+---+
|Baz|    6| 22| US|
|Foo|    1| 39| UK|
|Bar|    2| 57| CA|
|Foo|    3| 41| US|
+---+-----+---+---+

Time taken: 7086 ms
+---+---+---+-----+
|  x|  y|  z|count|
+---+---+---+-----+
|Baz| 22| US|    6|
|Foo| 39| UK|    1|
|Bar| 57| CA|    2|
|Foo| 41| US|    3|
+---+---+---+-----+

Time taken: 1563 ms

Peter Mortensen · Accepted Answer · 2023-11-09 20:02:41Z

-4

This is my Df contain 4 is repeated twice so here will remove repeated values.

scala> df.show
+-----+
|value|
+-----+
|    1|
|    4|
|    3|
|    5|
|    4|
|   18|
+-----+

scala> val newdf=df.dropDuplicates

scala> newdf.show
+-----+
|value|
+-----+
|    1|
|    3|
|    5|
|    4|
|   18|
+-----+

edited Nov 9, 2023 at 20:02

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Nov 10, 2017 at 7:30

Nilesh Shinde

4675 silver badges10 bronze badges

5 Comments

Nilesh Shinde Over a year ago

you can check in spark-shell i have shared the correct output.. this ans is s related to how we can remove repeated values in column or df..

Alex Over a year ago

Can you provide an example based on OPs question?

Nilesh Shinde Over a year ago

I have given example in my answer it self. you can refer that one.

Jason Over a year ago

Your post adds no value to this discussion. @vaerek has already posted a PySpark df.dropDuplicates() example including how it can be applied to more than one column (my initial question).

Peter Mortensen Over a year ago

The sentence is incomprehensible. Please add the missing punctuation, missing words, etc. Thanks in advance.

Collectives™ on Stack Overflow

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

8 Answers 8

4 Comments

Comments

2 Comments

1 Comment

1 Comment

2 Comments

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

4 Comments

Comments

2 Comments

1 Comment

1 Comment

2 Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related