1

I have a dataframe df with columns

date: timestamp
status : String
name : String

I'm trying to find last status of the all the names

  val users = df.select("name").distinct
  val final_status = users.map( t =>
  {
     val _name =  t.getString(0)
     val record = df.where(col("name") === _name)
     val lastRecord = userRecord.sort(desc("date")).first
     lastRecord
   })

This works with an array, but with spark dataframe it is throwing java.lang.NullPointerException

Update1 : Using removeDuplicates

df.sort(desc("date")).removeDuplicates("name")

Is this a good solution?

1
  • a) This has been covered multiple times on so and it cannot work b) what is the source of removeDuplicates? Doesn't look like an existing method. Commented Apr 14, 2016 at 9:21

1 Answer 1

1

This

df.sort(desc("date")).removeDuplicates("name")

is not guaranteed to work. The solutions in response to this question should work for you

spark: How to do a dropDuplicates on a dataframe while keeping the highest timestamped row

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.