Count instances of combination of columns in spark dataframe using scala

Question

I have a spark data frame in scala called df with two columns, say a and b. Column a contains letters and column b contains numbers giving the below.

   a   b
----------
   g   0
   f   0
   g   0
   f   1

I can get the distinct rows using

val dfDistinct=df.select("a","b").distinct

which gives the following:

   a  b
----------
   g   0
   f   0
   f   1

I want to add another column with the number of times these distinct combinations occurs in the first dataframe so I'd end up with

a  b  count
  ----------
  g  0   2
  f  0   1
  f  1   1

I don't mind if that modifies the original command or I have a separate operation on dfDistinct giving another data frame.

Any advice greatly appreciated and I apologise for the trivial nature of this question but I'm not the most experienced with this kind of operation in scala or spark.

Thanks

Dean

zero323 · Accepted Answer · 2015-10-28 15:32:54Z

13

You can simply aggregate and count:

df.groupBy($"a", $"b").count

or a little bit more verbose:

import org.apache.spark.sql.functions.{count, lit}

df.groupBy($"a", $"b").agg(count(lit(1)).alias("cnt"))

Both are equivalent to a raw SQL aggregation:

df.registerTempTable("df")

sqlContext.sql("SELECT a, b, COUNT(1) AS cnt FROM df GROUP BY a, b")

edited Oct 28, 2015 at 15:32

answered Oct 28, 2015 at 15:06

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dean Over a year ago

Its always simple when you know it but I don't find it trivial to get the information. Am I missing a resource? Thank you, by the way. Exactly what I wanted.

zero323 Over a year ago

I don't know :) Maybe Spark SQL and DataFrame Guide?

Dean Over a year ago

Thanks again. They have changed the behaviour of that in 5.1! A lesson to read the changes. Accepted the answer and tried to upvote but don't have the rep!

oluies · Accepted Answer · 2016-11-29 00:33:23Z

4

Also see Cross Tabulation

val g="g"
val f = "f"
val df = Seq(
  (g, "0"),
  (f, "0"),
  (g, "0"),
  (f, "1")
).toDF("a", "b")
val res = df.stat.crosstab("a","b")
res.show

+---+---+---+
|a_b|  0|  1|
+---+---+---+
|  g|  2|  0|
|  f|  1|  1|

answered Nov 29, 2016 at 0:33

oluies

17.9k14 gold badges79 silver badges122 bronze badges

Collectives™ on Stack Overflow

Count instances of combination of columns in spark dataframe using scala

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related