How to group by rollup on only some columns in Apache Spark SQL?

Question

I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:

select
  coalesce(a, "All A") as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by rollup (
  colA,
  colB
)
order by
  colA asc,
  colB asc

I'd then expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| All A | All B |  300 |
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:

select
  a as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by 
  a,
  rollup (b)
order by
  colA asc,
  colB asc

And I'd expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.

I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.

Thanks in advance, and please let me know if I can clarify anything.

You may be able to do it with GROUPING SETS stackoverflow.com/questions/40923680/…. Available in SQL only. — mazaneicha
– mazaneicha, Commented Jul 1, 2020 at 22:36

wBob · Accepted Answer · 2020-07-02 22:06:13Z

4

Try this GROUPING SETS option:

%sql
SELECT 
  COALESCE( a, 'all a' ) a, 
  COALESCE( b, 'all b' ) b, 
  SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b

My results (with updated numbers):

edited Jul 2, 2020 at 22:06

answered Jul 1, 2020 at 22:57

wBob

14.4k3 gold badges26 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Keith Over a year ago

I believe GROUPING SETS is indeed what I need here. Thank you very much. The syntax you've provided doesn't quite achieve the desired output (it looks like you're doing something semantically equivalent to GROUP BY ROLLUP). I'll experiment later this week when I have time again, and once I have the right syntax, I'll follow up. From there, if you update your answer to match, I'll mark it as accepted.

wBob Over a year ago

Thanks Keith - I was guessing a bit as your numbers don’t quite add up but hopefully you get the idea.

Keith Over a year ago

Sorry about that—the numbers were just placeholders. I'll update the question to make them clearer.

wBob Over a year ago

How are you getting on? GROUPING SETS is standard ANSI SQL so you should be able to read about it and how it works. The way I think of it is, grouping sets can add extra summary rows to your result and you control what those rows are. Two brackets () is shorthand for "summarise everything"; there's some other notes here.

Keith Over a year ago

I've had the opportunity to use GROUPING SETS in a few other contexts now, and it's beginning to make a lot more sense with practice. Thank you again for pointing me in the right direction!

|

Collectives™ on Stack Overflow

How to group by rollup on only some columns in Apache Spark SQL?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related