How to GROUPING SETS as operator/method on Dataset?

Question

Is there no function level grouping_sets support in spark scala?

I have no idea this patch applied to master https://github.com/apache/spark/pull/5080

I want to do this kind of query by scala dataframe api.

GROUP BY expression list GROUPING SETS(expression list2)

cube and rollup functions are available in Dataset API, but can't find grouping sets. Why?

Jacek Laskowski · Accepted Answer · 2021-12-20 16:03:12Z

9

I want to do this kind of query by scala dataframe api.

tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add such an operator to Dataset API.

Spark SQL supports the following so-called multi-dimensional aggregate operators:

rollup operator
cube operator
GROUPING SETS clause (only in SQL mode)
grouping() and grouping_id() functions

NOTE: GROUPING SETS is only available in SQL mode. There is no support in Dataset API.

GROUPING SETS

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)

scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)

scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

If a value in a column of resulting table is null, it may not necessarily mean that the column was aggregated on that row. If that column has nulls in the original table, null value in the aggregations table may represent just a null value from the original table. Use grouping function to check if the column was aggregated on the specific row or not.

edited Dec 20, 2021 at 16:03

answered Apr 25, 2017 at 9:41

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pushpendra Jaiswal Over a year ago

how to query reports like I want to fectch total across city 2015. Do I need to put all other column as null. e.g select amount from table where year=2015 and city is null and every other column is null

Jacek Laskowski Over a year ago

@PushpendraJaiswal That'd be my suggestion exactly.

ZygD Over a year ago

If a value in a column of resulting table is null, it may not necessarily mean that the column was aggregated on that row. If that column has nulls in the original table, null value in the aggregations table may represent just a null value from the original table. You can check if the column was aggregated on the specific row by using grouping function.

user6022341 · Accepted Answer · 2016-12-02 02:25:17Z

0

Spark supports GROUPING SETS. You can find corresponding tests here:

https://github.com/apache/spark/blob/5b7d403c1819c32a6a5b87d470f8de1a8ad7a987/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql#L25-L28

answered Dec 2, 2016 at 2:25

community wiki

user6022341

2 Comments

Jihun No Over a year ago

but how about dataframe function api. I can not find it.

user6022341 Over a year ago

There isn't. It is available only in SQL.

Collectives™ on Stack Overflow

How to GROUPING SETS as operator/method on Dataset?

2 Answers 2

GROUPING SETS

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

GROUPING SETS

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related