Pyspark dataframe: How to remove duplicate rows in a dataframe in databricks

Question

I have an existing dataframe in databricks which contains many rows are exactly the same in all column values. example like below:

df:

No.	Name	Age	Country
1	John	20	US
1	John	20	US
2	Cici	25	Japan
3	Tom	36	Canada
3	Tom	36	Canada
3	Tom	36	Canada

I want to have the below finally.

No.	Name	Age	Country
1	John	20	US
2	Cici	25	Japan
3	Tom	36	Canada

How to write the scripts? Thank you

Does this answer your question? Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame — samkart
– samkart, Commented Jul 20, 2023 at 10:03

notNull · Accepted Answer · 2023-07-20 02:56:54Z

0

use either distinct (or) dropDuplicates() functions on the dataframe.

Example:

df.distinct().show()

(or)

df.dropDuplicates().show()

Sample code:

df = spark.createDataFrame([(1,'John',20,'US'),(1,'John',20,'US'),(1,'John',20,'US'),(2,'CICI',25,'Japan')],['No.','Name','Age','country'])
df.distinct().show()
df.dropDuplicates().show()
#output
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+
#
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+

answered Jul 20, 2023 at 2:56

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark dataframe: How to remove duplicate rows in a dataframe in databricks

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related