5

In cassandra I have a list column type. I am new to spark and scala, and have no idea where to start. In spark I want get count of each values, is it possible to do so. Below is the dataframe

+--------------------+------------+
|                  id|        data|
+--------------------+------------+
|53e5c3b0-8c83-11e...|      [b, c]|
|508c1160-8c83-11e...|      [a, b]|
|4d16c0c0-8c83-11e...|   [a, b, c]|
|5774dde0-8c83-11e...|[a, b, c, d]|
+--------------------+------------+

I want output as

+--------------------+------------+
|   value            |      count |
+--------------------+------------+
|a                   |      3     |
|b                   |      4     |
|c                   |      3     |
|d                   |      1     |
+--------------------+------------+

spark version: 1.4

0

2 Answers 2

8

Here you go :

scala> val rdd = sc.parallelize(
  Seq(
    ("53e5c3b0-8c83-11e", Array("b", "c")),
    ("53e5c3b0-8c83-11e1", Array("a", "b")),
    ("53e5c3b0-8c83-11e2", Array("a", "b", "c")),
    ("53e5c3b0-8c83-11e3", Array("a", "b", "c", "d"))))
// rdd: org.apache.spark.rdd.RDD[(String, Array[String])] = ParallelCollectionRDD[22] at parallelize at <console>:27

scala> rdd.flatMap(_._2).map((_, 1)).reduceByKey(_ + _)
// res11: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:30

scala> rdd.flatMap(_._2).map((_,1)).reduceByKey(_ + _).collect
// res16: Array[(String, Int)] = Array((a,3), (b,4), (c,3), (d,1))

This is also actually quite easy with the DataFrame API :

scala> val df = rdd.toDF("id", "data")
// res12: org.apache.spark.sql.DataFrame = ["id": string, "data": array<string>]

scala> df.select(explode($"data").as("value")).groupBy("value").count.show
// +-----+-----+
// |value|count|
// +-----+-----+
// |    d|    1|
// |    c|    3|
// |    b|    4|
// |    a|    3|
// +-----+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

Can you provide pyspark implementation of the solution?
2

You need something like this (from Apache Spark Examples):

val textFile = sc.textFile("hdfs://...")
val counts = textFile
             .flatMap(line => line.split(" "))
             .map(word => (word, 1))
             .reduceByKey(_ + _)

Guessing that you already have pairs, .reduceByKey(_ + _) will return what you need.

You can also try in spark shell something like this:

sc.parallelize(Array[Integer](1,1,1,2,2),3).map(x=>(x,1)).reduceByKey(_+_).foreach(println)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.