Scala Spark - Count occurrences of a specific string in Dataframe column

Question

How can I count the occurrences of a String in a df Column using Spark partitioned by id?

e.g. Find the value "test" in column "name" of a df

In SQL would be:

 SELECT
    SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
  FROM
    mytable
 WINDOW window AS (PARTITION BY id)

I've tried using map( v => match { case "test" -> 1.. })

and things like:

def getCount(df: DataFrame): DataFrame = {
    val dfCnt = df.agg(
          .withColumn("cnt_test", 
            count(col("name")==lit('test'))
)

Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)?

thanks

does any answer helped you? If yes, please accept it

Raphael Roth
– Raphael Roth

2017-11-01 06:54:12 +00:00
Commented Nov 1, 2017 at 6:54 — Raphael Roth
– Raphael Roth, Commented Nov 1, 2017 at 6:54

akuiper · Accepted Answer · 2017-10-29 00:28:07Z

8

You can use groupBy + agg in spark; Here when($"name" == "test", 1) transforms name column to 1 if name == 'test', null otherwise, and count gives count of non null values:

df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))

Example:

val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

Or similar to your sql queries:

df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

edited Oct 29, 2017 at 0:28

answered Oct 29, 2017 at 0:22

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Raphael Roth · Accepted Answer · 2017-10-29 10:22:59Z

0

If you want to translate your SQL, you can just also Window-functions in Spark as well:

def getCount(df: DataFrame): DataFrame = {
  import org.apache.spark.sql.expressions.Window

  df.withColumn("cnt_test",
      sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
    )
}

answered Oct 29, 2017 at 10:22

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Collectives™ on Stack Overflow

Scala Spark - Count occurrences of a specific string in Dataframe column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related