spark dataframe drop duplicates and keep first

Question

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?

Pandas:

df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime', 'flt_flightnumber'], keep='first')

Spark dataframe (I use Spark 1.6.0) doesn't have the keep option

df.orderBy(['actual_datetime']).dropDuplicates(subset=['scheduled_datetime', 'flt_flightnumber'])

Imagine scheduled_datetime and flt_flightnumber are columns 6 ,17. By creating keys based on the values of these columns we can also deduplicate

def get_key(x):
    return "{0}{1}".format(x[6],x[17])

df= df.map(lambda x: (get_key(x),x)).reduceByKey(lambda x,y: (x))

but how to specify to keep the first row and get rid of the other duplicates ? What about the last row ?

When you run dropDuplicates you are keeping the first row per the columns combination specified and getting rid of the rest of the dupes. Are you sure your code isn't already doing what you want? — David Arenburg
– David Arenburg, Commented Jul 31, 2016 at 19:10
But then what to do if I want to keep the last row? I am looking only at one column for duplicate values — hipoglucido
– hipoglucido, Commented May 9, 2017 at 12:38
I think that if you want to keep the last row then you should sort descending first, and then drop duplicates. — SwampDev
– SwampDev, Commented Apr 11, 2018 at 13:26

timle · Accepted Answer · 2019-03-05 19:18:46Z

To everyone saying that dropDuplicates keeps the first occurrence - this is not strictly correct.

dropDuplicates keeps the 'first occurrence' of a sort operation - only if there is 1 partition. See below for some examples.
However this is not practical for most Spark datasets. So I'm also including an example of 'first occurrence' drop duplicates operation using Window function + sort + rank + filter.
See bottom of post for example.

This is tested in Spark 2.4.0 using pyspark.

dropDuplicates examples

import pandas as pd

# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)

   col1     datestr
0     0  2018-01-01
1     1  2018-01-01
2     2  2018-01-01
3     3  2018-01-01
4     4  2018-01-01
0     0  2018-02-01
1     1  2018-02-01
2     2  2018-02-01
3     3  2018-02-01
4     4  2018-02-01
0     0  2018-03-01
1     1  2018-03-01
2     2  2018-03-01
3     3  2018-03-01
4     4  2018-03-01

# first example
# does not give first (based on datestr)
(spark.createDataFrame(dfall)
   .orderBy('datestr')
   .dropDuplicates(subset = ['col1'])
   .show()
)

# dropDuplicates NOT based on occurrence of sorted datestr

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-03-01|
|   1|2018-02-01|
|   3|2018-02-01|
|   2|2018-02-01|
|   4|2018-01-01|
+----+----------+

# second example
# testing what happens with repartition
(spark.createDataFrame(dfall)
   .orderBy('datestr')
   .repartition('datestr')
   .dropDuplicates(subset = ['col1'])
   .show()
)

# dropDuplicates NOT based on occurrence of sorted datestr

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-02-01|
|   1|2018-01-01|
|   3|2018-02-01|
|   2|2018-02-01|
|   4|2018-02-01|
+----+----------+

#third example
# testing with coalesce(1)
(spark
   .createDataFrame(dfall)
   .orderBy('datestr')
   .coalesce(1)
   .dropDuplicates(subset = ['col1'])
   .show()
)

# dropDuplicates based on occurrence of sorted datestr

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   1|2018-01-01|
|   2|2018-01-01|
|   3|2018-01-01|
|   4|2018-01-01|
+----+----------+

# fourth example
# testing with reverse sort then coalesce(1)
(spark
   .createDataFrame(dfall)
   .orderBy('datestr', ascending = False)
   .coalesce(1)
   .dropDuplicates(subset = ['col1'])
   .show()
)
# dropDuplicates based on occurrence of sorted datestr```

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-03-01|
|   1|2018-03-01|
|   2|2018-03-01|
|   3|2018-03-01|
|   4|2018-03-01|
+----+----------+

window, sort, rank, filter example

# generating some example data with pandas
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
# into spark df
df_s = (spark.createDataFrame(dfall))

from pyspark.sql import Window
from pyspark.sql.functions import rank
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   1|2018-01-01|
|   3|2018-01-01|
|   2|2018-01-01|
|   4|2018-01-01|
+----+----------+

# however this fails if ties/duplicates exist in the windowing paritions
# and so a tie breaker for the 'rank' function must be added

# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-01-01' # note duplicates in this dataset
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)

   col1     datestr
0     0  2018-01-01
1     1  2018-01-01
2     2  2018-01-01
3     3  2018-01-01
4     4  2018-01-01
0     0  2018-01-01
1     1  2018-01-01
2     2  2018-01-01
3     3  2018-01-01
4     4  2018-01-01
0     0  2018-03-01
1     1  2018-03-01
2     2  2018-03-01
3     3  2018-03-01
4     4  2018-03-01

# this will fail, since duplicates exist within the window partitions
# and no way to specify ranking style exists in pyspark rank() fn
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   0|2018-01-01|
|   1|2018-01-01|
|   1|2018-01-01|
|   3|2018-01-01|
|   3|2018-01-01|
|   2|2018-01-01|
|   2|2018-01-01|
|   4|2018-01-01|
|   4|2018-01-01|
+----+----------+

# to deal with ties within window partitions, a tiebreaker column is added
from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("col1").orderBy("datestr",'tiebreak')
(df_s
 .withColumn('tiebreak', monotonically_increasing_id())
 .withColumn('rank', rank().over(window))
 .filter(col('rank') == 1).drop('rank','tiebreak')
 .show()
)

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   1|2018-01-01|
|   3|2018-01-01|
|   2|2018-01-01|
|   4|2018-01-01|
+----+----------+

I tested with real dataset(500k+), .coalesce(1) is required.
None of the above gives the right answer. We need a better answer for this post.
This answer is almost the optimal. Instead of 'rank' which allows ties to share the same rank, use the row_number() function in place of rank(). This is simpler than using a tiebreaker column.
Good examples, but coalesce(1) pushes all data to driver and with bigger volume you will run into memory problems
Joey Lesh is Right. Check the answer by Alex right at the bottom that combines the solution as per Joey's input.

Tyler2P · Accepted Answer · 2022-02-12 11:33:33Z

25

Use window and row_number functions.
Order by ascending or descending to select first or last.

from pyspark.sql import Window
from pyspark.sql import functions as f

window = Window.partitionBy("col1").orderBy("datestr").asc()
df = (df.withColumn('row', f.row_number().over(window))\
.filter(col('row') == 1)
.drop('row')
.show())

edited Feb 12, 2022 at 11:33

Tyler2P

2,37030 gold badges26 silver badges34 bronze badges

answered Feb 11, 2022 at 16:03

Alex

2513 silver badges2 bronze badges

3 Comments

blackbishop Over a year ago

This answer already exists stackoverflow.com/a/58540058/1386551

flying_fluid_four Over a year ago

no need for asc(). Default behavior of orderBy is already ascending

timedacorn Over a year ago

does paritionBy and row_number also result in pushing all data into driver. Similar to how it does for coalesce(1) as mentioned in the top answer.?

Carmen Pérez Carrillo · Accepted Answer · 2020-01-27 10:02:47Z

You can use a window with row_number:

import pandas as pd
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = spark.createDataFrame(pd.concat([df1,df2,df3]))

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col,row_number
window = Window.partitionBy('col1').orderBy(col('datestr'))
dfall.select('*', row_number().over(window).alias('posicion')).show()
dfall.select('*', row_number().over(window).alias('posicion')).where('posicion ==1').show()

+----+----------+--------+
|col1|   datestr|posicion|
+----+----------+--------+
|   0|2018-01-01|       1|
|   0|2018-02-01|       2|
|   0|2018-03-01|       3|
|   1|2018-01-01|       1|
|   1|2018-02-01|       2|
|   1|2018-03-01|       3|
|   3|2018-01-01|       1|
|   3|2018-02-01|       2|
|   3|2018-03-01|       3|
|   2|2018-01-01|       1|
|   2|2018-02-01|       2|
|   2|2018-03-01|       3|
|   4|2018-01-01|       1|
|   4|2018-02-01|       2|
|   4|2018-03-01|       3|
+----+----------+--------+
+----+----------+--------+
|col1|   datestr|posicion|
+----+----------+--------+
|   0|2018-01-01|       1|
|   1|2018-01-01|       1|
|   3|2018-01-01|       1|
|   2|2018-01-01|       1|
|   4|2018-01-01|       1|
+----+----------+--------+

Mahmoud Hanafy · Accepted Answer · 2020-02-11 16:33:17Z

5

I did the following:

dataframe.groupBy("uniqueColumn").min("time")

This will group by the given column, and within the same group choose the one with min time (this will keep the first and remove others)

answered Feb 11, 2020 at 16:33

Mahmoud Hanafy

1,8973 gold badges24 silver badges37 bronze badges

3 Comments

KevinG Over a year ago

Care to elaborate @mahmoud-hanafy? Using goupyby and min/max is clear.. but how does this solve the duplicates problem? How can we ensure that rows with the min time are always kept?

Mahmoud Hanafy Over a year ago

Hi KevinG, you need to understand how groupBy works. If you have a List((0, 1), (0, 2), (0, 3), (1, 4)) then you do groupBy(_._1) you will have => Map(0 -> List(1, 2, 3), 1 -> List(4)). Then when choosing the min one of the values you will have => Map(0 -> 1, 1 -> 4). So duplicate values will be removed and you will keep the min one only

malumno Over a year ago

You will need a subsequent join to keep any additional columns, though.

Veera Marni · Accepted Answer · 2019-12-11 22:00:40Z

4

solution 1 add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)

solution 2: turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.

One of my friend (sameer) mentioned that below(old solution) didn't work for him. use dropDuplicates method by default it keeps the first occurance.

edited Dec 11, 2019 at 22:00

answered Jun 30, 2017 at 16:21

Veera Marni

1551 silver badge7 bronze badges

5 Comments

Manrique Over a year ago

can you please provide a valid source ?

Gunay Anach Over a year ago

@Manrique - Docs are here -> spark.apache.org/docs/2.1.0/api/python/…

Manrique Over a year ago

Thanks @GunayAnach, but I can't see in that doc that by default dropDuplicates keeps the first occurrence.

Gunay Anach Over a year ago

It seems to be based on pandas, where -> keep : {‘first’, ‘last’, False}, default ‘first’ pandas.pydata.org/pandas-docs/stable/reference/api/… But you are right, not specified on the docs. They seem rather dry, compared to pandas.

MasterLuV Over a year ago

How can i keep last element in scala

Gunay Anach · Accepted Answer · 2019-06-26 10:37:03Z

I just did something perhaps similar to what you guys need, using drop_duplicates pyspark.

Situation is this. I have 2 dataframes (coming from 2 files) which are exactly same except 2 columns file_date(file date extracted from the file name) and data_date(row date stamp). Annoyingly I have rows which are with same data_date (and all other column cells too) but different file_date as they get replicated on every newcomming file with an addition of one new row.

I needed to capture all rows from the new file, plus that one row left over from the previous file. That row is not in the new file. Remaining columns on the right from data_date are same between the two files for the same data_date.

file_1_20190122 - df1

+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
|        AGGH|2019-01-22|2019-01-16| <- One row we want to keep where file_date 22nd
|        AGGH|2019-01-22|2019-01-17|
|        AGGH|2019-01-22|2019-01-18|
|        AGGH|2019-01-22|2019-01-19|
|        AGGH|2019-01-22|2019-01-20|
|        AGGH|2019-01-22|2019-01-21|
|        AGGH|2019-01-22|2019-01-22|


file_2_20190123 - df2

+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
|        AGGH|2019-01-23|2019-01-17| \/ ALL rows we want to keep where file_date 23rd
|        AGGH|2019-01-23|2019-01-18|
|        AGGH|2019-01-23|2019-01-19|
|        AGGH|2019-01-23|2019-01-20|
|        AGGH|2019-01-23|2019-01-21|
|        AGGH|2019-01-23|2019-01-22|
|        AGGH|2019-01-23|2019-01-23|

This will require us to sort and concat df's, then deduplicate them on all columns but one. Let me walk you through.

union_df = df1.union(df2) \
                .sort(['station_code', 'data_date'], ascending=[True, True])

+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
|        AGGH|2019-01-22|2019-01-16| <- keep
|        AGGH|2019-01-23|2019-01-17| <- keep
|        AGGH|2019-01-22|2019-01-17| x- drop
|        AGGH|2019-01-22|2019-01-18| x- drop
|        AGGH|2019-01-23|2019-01-18| <- keep
|        AGGH|2019-01-23|2019-01-19| <- keep
|        AGGH|2019-01-22|2019-01-19| x- drop
|        AGGH|2019-01-23|2019-01-20| <- keep
|        AGGH|2019-01-22|2019-01-20| x- drop
|        AGGH|2019-01-22|2019-01-21| x- drop
|        AGGH|2019-01-23|2019-01-21| <- keep
|        AGGH|2019-01-23|2019-01-22| <- keep
|        AGGH|2019-01-22|2019-01-22| x- drop
|        AGGH|2019-01-23|2019-01-23| <- keep

Here we drop already sorted duped rows excluding keys ['file_date', 'data_date'].

nonduped_union_df = union_df \
            .drop_duplicates(['station_code', 'data_date', 'time_zone', 
                              'latitude', 'longitude', 'elevation', 
                              'highest_temperature', 'lowest_temperature', 
                              'highest_temperature_10_year_normal', 
                              'another_50_columns'])

And the result holds ONE row with earliest date from DF1 which is not in DF2 and ALL rows from DF2

nonduped_union_df.select(['station_code', 'file_date', 'data_date', 
                          'highest_temperature', 'lowest_temperature']) \
                         .sort(['station_code', 'data_date'], ascending=[True, True]) \
                         .show(30)


+------------+----------+----------+-------------------+------------------+
|station_code| file_date| data_date|highest_temperature|lowest_temperature|
+------------+----------+----------+-------------------+------------------+
|        AGGH|2019-01-22|2019-01-16|                 90|                77| <- df1 22nd
|        AGGH|2019-01-23|2019-01-17|                 90|                77| \/- df2 23rd
|        AGGH|2019-01-23|2019-01-18|                 91|                75|
|        AGGH|2019-01-23|2019-01-19|                 88|                77|
|        AGGH|2019-01-23|2019-01-20|                 88|                77|
|        AGGH|2019-01-23|2019-01-21|                 88|                77|
|        AGGH|2019-01-23|2019-01-22|                 90|                75|
|        AGGH|2019-01-23|2019-01-23|                 90|                75|
|        CWCA|2019-01-22|2019-01-15|                 23|                -2|
|        CWCA|2019-01-23|2019-01-16|                  7|                -8|
|        CWCA|2019-01-23|2019-01-17|                 28|                -6|
|        CWCA|2019-01-23|2019-01-18|                  0|               -13|
|        CWCA|2019-01-23|2019-01-19|                 25|               -15|
|        CWCA|2019-01-23|2019-01-20|                 -4|               -18|
|        CWCA|2019-01-23|2019-01-21|                 27|                -6|
|        CWCA|2019-01-22|2019-01-22|                 30|                17|
|        CWCA|2019-01-23|2019-01-22|                 30|                13|
|        CWCO|2019-01-22|2019-01-15|                 34|                29|
|        CWCO|2019-01-23|2019-01-16|                 33|                13|
|        CWCO|2019-01-22|2019-01-16|                 33|                13|
|        CWCO|2019-01-22|2019-01-17|                 23|                 7|
|        CWCO|2019-01-23|2019-01-17|                 23|                 7|
+------------+----------+----------+-------------------+------------------+
only showing top 30 rows

It may not be best suitable answer for this case, but it's the one worked for me.

Let me know, if stuck somewhere.

BTW - if anyone can tell me how to select all columns in a df, except one without listing them in a list - I will be very thankful.

Regards G

how to select all columns in a df, except one : all_columns_but_one = [c for c in df.columns if c!="undesired_column_name"] df = df.select(all_columns_but_one)

Into Numbers · Accepted Answer · 2021-04-28 20:28:11Z

I would try this way:

Assuming your data_df looks like this, and we want to keep the rows with the highest value in col1 per datestr:

  col1     datestr
     0  2018-01-01
     1  2018-01-01
     2  2018-01-01
     3  2018-01-01
     4  2018-01-01
     0  2018-02-01
     1  2018-02-01
     2  2018-02-01
     3  2018-02-01
     4  2018-02-01
     0  2018-03-01
     1  2018-03-01
     2  2018-03-01
     3  2018-03-01
     4  2018-03-01

you can do:

from pyspark.sql import Window 
import pyspark.sql.functions as F

w = Window.partitionBy('datestr')
data_df = data_df.withColumn("max", F.max(F.col("col1"))\
    .over(w))\
    .where(F.col('max') == F.col('col1'))\
    .drop("max")

this results in:

  col1     datestr
     4  2018-01-01
     4  2018-02-01
     4  2018-03-01

Vinayak Sangar · Accepted Answer · 2021-03-10 08:03:49Z

-1

Given the below table:

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   1|2018-01-01|
|   2|2018-01-01|
|   3|2018-01-01|
|   4|2018-01-01|
|   0|2018-02-01|
|   1|2018-02-01|
|   2|2018-02-01|
|   3|2018-02-01|
|   4|2018-02-01|
|   0|2018-03-01|
|   1|2018-03-01|
|   2|2018-03-01|
|   3|2018-03-01|
|   4|2018-03-01|
+----+----------+

You can do it in two steps:

Group by the given table based upon the col1 and pick min date.

+----+----------+
|col1|   datestr|
+----+----------+
|   0|2018-01-01|
|   1|2018-01-01|
|   2|2018-01-01|
|   3|2018-01-01|
|   4|2018-01-01|
+----+----------+

left Join the resultant table with original table on col1 and min_datestr.

answered Mar 10, 2021 at 8:03

Vinayak Sangar

1073 silver badges12 bronze badges

Comments

Eniola Olaleye · Accepted Answer · 2021-10-29 16:22:10Z

-1

if datasets isnt not large convert to pandas data frame and drop duplicates keeping last or first then convert back.

answered Oct 29, 2021 at 16:22

Eniola Olaleye

1

1 Comment

parlad Over a year ago

Hi there, thanks for sharing your thoughts .The question is not asking whether to use a pandas DF or Spark DF. Changing technologies isn't simple.Please read the questions in detail.

Collectives™ on Stack Overflow

spark dataframe drop duplicates and keep first

9 Answers 9

dropDuplicates examples

window, sort, rank, filter example

6 Comments

3 Comments

Comments

3 Comments

5 Comments

1 Comment

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

dropDuplicates examples

window, sort, rank, filter example

6 Comments

3 Comments

Comments

3 Comments

5 Comments

1 Comment

Comments

Comments

1 Comment

Linked

Related