Filtering a Dataframe based on another Dataframe in Spark

Question

I have a dataframe df with columns

date: timestamp
status : String
name : String

I'm trying to find last status of the all the names

  val users = df.select("name").distinct
  val final_status = users.map( t =>
  {
     val _name =  t.getString(0)
     val record = df.where(col("name") === _name)
     val lastRecord = userRecord.sort(desc("date")).first
     lastRecord
   })

This works with an array, but with spark dataframe it is throwing java.lang.NullPointerException

Update1 : Using removeDuplicates

df.sort(desc("date")).removeDuplicates("name")

Is this a good solution?

a) This has been covered multiple times on so and it cannot work b) what is the source of removeDuplicates? Doesn't look like an existing method. — zero323
– zero323, Commented Apr 14, 2016 at 9:21

Community · Accepted Answer · 2017-05-23 12:06:46Z

1

This

df.sort(desc("date")).removeDuplicates("name")

is not guaranteed to work. The solutions in response to this question should work for you

spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Apr 14, 2016 at 14:29

David

11.6k4 gold badges44 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Filtering a Dataframe based on another Dataframe in Spark

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related