0
val someDF = Seq(
  (4623874, "user1", "success"),
  (4623874, "user2","fail"),
  (4623874, "user3","success"),
  (1343244, "user4","fail"),
  (4235252, "user5", "fail")
).toDF("primaryid", "user","status")


This is the input data frame is it possible to get the count status for each primary id other than groupby

someDF.groupBy("primaryid", "status").count.show



+-------+-------+-----+
primaryid| status|count|
+-------+-------+-----+
|4235252|   fail|    1|
|1343244|   fail|    1|
|4623874|   fail|    1|
|4623874|success|    2|
+-------+-------+-----+


Any other way to get the above result other than "groupby" ?

1 Answer 1

1

Use count window function. Check below code.

scala> val someDF = Seq(
     |   (4623874, "user1", "success"),
     |   (4623874, "user2","fail"),
     |   (4623874, "user3","success"),
     |   (1343244, "user4","fail"),
     |   (4235252, "user5", "fail")
     | ).toDF("primaryid", "user","status")

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

someDF
.withColumn("count",
    count($"status")
    .over(
        Window
        .partitionBy($"primaryid",$"status")
        .orderBy($"primaryid".asc)
    )
).show(false)
+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4235252  |user5|fail   |1    |
|1343244  |user4|fail   |1    |
|4623874  |user2|fail   |1    |
|4623874  |user1|success|2    |
|4623874  |user3|success|2    |
+---------+-----+-------+-----+
scala> :paste
// Entering paste mode (ctrl-D to finish)

someDF
.withColumn("count",
    count($"status")
    .over(
        Window
        .partitionBy($"primaryid",$"status")
        .orderBy($"primaryid".asc)
    )
)
.filter($"status" === "success")
.show(false)

// Exiting paste mode, now interpreting.

+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4623874  |user1|success|2    |
|4623874  |user3|success|2    |
+---------+-----+-------+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your response one more clarification, Is it possible to get only success count for specific primaryid . I mean like this someDF.groupBy("primaryid").agg(count(col("status") === "success").show()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.