4

I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:

select
  coalesce(a, "All A") as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by rollup (
  colA,
  colB
)
order by
  colA asc,
  colB asc

I'd then expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| All A | All B |  300 |
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:

select
  a as colA,
  coalesce(b, "All B") as colB,
  sum(c) as sumC
from
  myTable
group by 
  a,
  rollup (b)
order by
  colA asc,
  colB asc

And I'd expect an output like:

+-------+-------+------+
| colA  | colB  | sumC |
+-------+-------+------+
| a1    | All B |  100 |
| a1    | b1    |  30  |
| a1    | b2    |  70  |
| a2    | All B |  200 |
| a2    | b1    |  50  |
| a2    | b2    |  150 |
+-------+-------+------+

I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.

I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.

Thanks in advance, and please let me know if I can clarify anything.

1

1 Answer 1

4

Try this GROUPING SETS option:

%sql
SELECT 
  COALESCE( a, 'all a' ) a, 
  COALESCE( b, 'all b' ) b, 
  SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b

My results (with updated numbers):

My results

Sign up to request clarification or add additional context in comments.

6 Comments

I believe GROUPING SETS is indeed what I need here. Thank you very much. The syntax you've provided doesn't quite achieve the desired output (it looks like you're doing something semantically equivalent to GROUP BY ROLLUP). I'll experiment later this week when I have time again, and once I have the right syntax, I'll follow up. From there, if you update your answer to match, I'll mark it as accepted.
Thanks Keith - I was guessing a bit as your numbers don’t quite add up but hopefully you get the idea.
Sorry about that—the numbers were just placeholders. I'll update the question to make them clearer.
How are you getting on? GROUPING SETS is standard ANSI SQL so you should be able to read about it and how it works. The way I think of it is, grouping sets can add extra summary rows to your result and you control what those rows are. Two brackets () is shorthand for "summarise everything"; there's some other notes here.
I've had the opportunity to use GROUPING SETS in a few other contexts now, and it's beginning to make a lot more sense with practice. Thank you again for pointing me in the right direction!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.