5

How can I count the occurrences of a String in a df Column using Spark partitioned by id?

e.g. Find the value "test" in column "name" of a df

In SQL would be:

 SELECT
    SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
  FROM
    mytable
 WINDOW window AS (PARTITION BY id)

I've tried using map( v => match { case "test" -> 1.. })

and things like:

def getCount(df: DataFrame): DataFrame = {
    val dfCnt = df.agg(
          .withColumn("cnt_test", 
            count(col("name")==lit('test'))
)

Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)?

thanks

1
  • does any answer helped you? If yes, please accept it Commented Nov 1, 2017 at 6:54

2 Answers 2

8

You can use groupBy + agg in spark; Here when($"name" == "test", 1) transforms name column to 1 if name == 'test', null otherwise, and count gives count of non null values:

df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))

Example:

val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+

Or similar to your sql queries:

df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
|  b|       1|
|  a|       0|
+---+--------+
Sign up to request clarification or add additional context in comments.

Comments

0

If you want to translate your SQL, you can just also Window-functions in Spark as well:

def getCount(df: DataFrame): DataFrame = {
  import org.apache.spark.sql.expressions.Window

  df.withColumn("cnt_test",
      sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
    )
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.