2

I want to concatenate arrays across rows and then do a distinct count. Ideally, this would work:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)
SELECT
  SUM(value) as total_value,
  ARRAY_LENGTH(ARRAY_CONCAT_AGG(DISTINCT key)) as unique_key_count
FROM test

Unfortunately, the ARRAY_CONCAT_AGG function doesn't support the DISTINCT operator. I can unnest the array but then I get a fanout and the sum of the value column is wrong:

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  COUNT(DISTINCT k) as unique_key_count

FROM test
  CROSS JOIN UNNEST(key) k

enter image description here

Is there anything I'm missing that would allow me to avoid joining in the unnested array?

2 Answers 2

8

Here is an alternative:

CREATE TEMP FUNCTION DistinctCount(arr ANY TYPE) AS (
  (SELECT COUNT(DISTINCT x) FROM UNNEST(arr) AS x)
);

WITH test AS
(
  SELECT
  DATE('2018-01-01') as date,
  2 as value,
  [1,2,3] as key
  UNION ALL
  SELECT
  DATE('2018-01-02') as date,
  3 as value,
  [1,4,5] as key
)

SELECT
  SUM(value) as total_value,
  DistinctCount(ARRAY_CONCAT_AGG(key)) as unique_key_count
FROM test

This avoids having a subquery or needing to join the array with the table (causing duplicate values in the sum).

Sign up to request clarification or add additional context in comments.

1 Comment

This is great, thanks. I want to use this in a query generation tool (Looker), so the subquery approach doesn't actually work for the use case. Interestingly, when I try it with a bunch of data, it's a lot slower. Didn't realise UDFs slow things down so much. The HLL++ approach below is much faster so that may be the way to go.
5

Below is for BigQuery Standard SQL

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT 
  total_value,
  COUNT(DISTINCT key) unique_key_count
FROM (
  SELECT
    SUM(value) AS total_value,
    ARRAY_CONCAT_AGG(key) AS all_keys
  FROM test
), UNNEST(all_keys) key
GROUP BY total_value  

result :

Row total_value unique_key_count     
1   5           5     

In case you you have quite a number of rows in your table - you can easily get to memory/resources issue - in this case you can try using HyperLogLog++ Functions for approximate aggregation - see example below

#standardSQL
WITH test AS
(
  SELECT DATE('2018-01-01') AS DATE, 2 AS value, [1,2,3] AS key UNION ALL
  SELECT DATE('2018-01-02') AS DATE, 3 AS value, [1,4,5] AS key
)
SELECT
  SUM(value) total_value,
  HLL_COUNT.MERGE((SELECT HLL_COUNT.INIT(key) FROM UNNEST(key) key)) AS unique_key_count
FROM test

with result

Row total_value unique_key_count     
1   5           5

Note: this is approximate aggregations - so pay attention to precision parameter in HLL_COUNT.INIT(input [, precision]) function

10 Comments

Thanks. Hadn't played around with the HLL++ functions before
That was the main purpose of the answer - to introduce those functions. They are usually overlooked :o)
Thanks, this is a great example of the usage of HLL++ functions!! I find the official documentation greatly lacking with clear examples (cloud.google.com/bigquery/docs/reference/standard-sql/…)
@RogierWerschkull - Take a look at this Blog by one of my most favorite Googlers :o)
While the HLL approach works, my client wants exact distinct counting. For that, the non-HLL version works of course but is not something that can be generated dynamically with Looker, the BI tool we are using. Is there another way to achieve this?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.