9

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

val dfDistinct=df.select("a","b").distinct

which gives the following:

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

2 Answers 2

13

You can simply aggregate and count:

df.groupBy($"a", $"b").count

or a little bit more verbose:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")
Sign up to request clarification or add additional context in comments.

3 Comments

Its always simple when you know it but I don't find it trivial to get the information. Am I missing a resource? Thank you, by the way. Exactly what I wanted.
I don't know :) Maybe Spark SQL and DataFrame Guide?
Thanks again. They have changed the behaviour of that in 5.1! A lesson to read the changes. Accepted the answer and tried to upvote but don't have the rep!
4

Also see Cross Tabulation

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.