1

I have an existing dataframe in databricks which contains many rows are exactly the same in all column values. example like below:

df:

No. Name Age Country
1 John 20 US
1 John 20 US
2 Cici 25 Japan
3 Tom 36 Canada
3 Tom 36 Canada
3 Tom 36 Canada

I want to have the below finally.

No. Name Age Country
1 John 20 US
2 Cici 25 Japan
3 Tom 36 Canada

How to write the scripts? Thank you

1

1 Answer 1

0

use either distinct (or) dropDuplicates() functions on the dataframe.

Example:

df.distinct().show()

(or)

df.dropDuplicates().show()

Sample code:

df = spark.createDataFrame([(1,'John',20,'US'),(1,'John',20,'US'),(1,'John',20,'US'),(2,'CICI',25,'Japan')],['No.','Name','Age','country'])
df.distinct().show()
df.dropDuplicates().show()
#output
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+
#
#+---+----+---+-------+
#|No.|Name|Age|country|
#+---+----+---+-------+
#|  1|John| 20|     US|
#|  2|CICI| 25|  Japan|
#+---+----+---+-------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.