Spark dataframe count the elements in the columns

Question

val someDF = Seq(
  (4623874, "user1", "success"),
  (4623874, "user2","fail"),
  (4623874, "user3","success"),
  (1343244, "user4","fail"),
  (4235252, "user5", "fail")
).toDF("primaryid", "user","status")

This is the input data frame is it possible to get the count status for each primary id other than groupby

someDF.groupBy("primaryid", "status").count.show



+-------+-------+-----+
primaryid| status|count|
+-------+-------+-----+
|4235252|   fail|    1|
|1343244|   fail|    1|
|4623874|   fail|    1|
|4623874|success|    2|
+-------+-------+-----+

Any other way to get the above result other than "groupby" ?

s.polam · Accepted Answer · 2020-10-28 08:35:17Z

1

Use count window function. Check below code.

scala> val someDF = Seq(
     |   (4623874, "user1", "success"),
     |   (4623874, "user2","fail"),
     |   (4623874, "user3","success"),
     |   (1343244, "user4","fail"),
     |   (4235252, "user5", "fail")
     | ).toDF("primaryid", "user","status")

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

someDF
.withColumn("count",
    count($"status")
    .over(
        Window
        .partitionBy($"primaryid",$"status")
        .orderBy($"primaryid".asc)
    )
).show(false)
+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4235252  |user5|fail   |1    |
|1343244  |user4|fail   |1    |
|4623874  |user2|fail   |1    |
|4623874  |user1|success|2    |
|4623874  |user3|success|2    |
+---------+-----+-------+-----+

scala> :paste
// Entering paste mode (ctrl-D to finish)

someDF
.withColumn("count",
    count($"status")
    .over(
        Window
        .partitionBy($"primaryid",$"status")
        .orderBy($"primaryid".asc)
    )
)
.filter($"status" === "success")
.show(false)

// Exiting paste mode, now interpreting.

+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4623874  |user1|success|2    |
|4623874  |user3|success|2    |
+---------+-----+-------+-----+

edited Oct 28, 2020 at 8:35

answered Oct 28, 2020 at 7:52

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Tulasi Over a year ago

Thanks for your response one more clarification, Is it possible to get only success count for specific primaryid . I mean like this someDF.groupBy("primaryid").agg(count(col("status") === "success").show()

Collectives™ on Stack Overflow

Spark dataframe count the elements in the columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related