Combining COUNT DISTINCT with FILTER - Spark SQL

Question

Given the two tables below, for each datapoint, I want to count the number of distinct years for which we have a value. However, Spark SQL does not allow combining COUNT DISTINCT and FILTER.

CREATE TABLE datapoints (name STRING);

INSERT INTO
  datapoints
VALUES
  ('Name'),
  ('Height'),
  ('Color');

CREATE TABLE entities (datapoint STRING, year INT, value STRING);

INSERT INTO
  entities
VALUES
  ('Name', 2015, 'John'),
  ('Name', 2015, 'Suzan'),
  ('Name', 2017, 'Jim'),
  ('Color', 2015, 'Blue')

SELECT
  dp.name,
  COUNT(DISTINCT year) FILTER (
    WHERE
      value IS NOT NULL
  ) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

Results in:

Error in SQL statement: AnalysisException: DISTINCT and FILTER cannot be used in aggregate functions at the same time; line 3 pos 2

What would be the functionally equivalent valid Spark SQL statement? The expected output is (notice the duplicate year for 'Name'):

name	DPCount
Color	1
Height	0
Name	2

mck · Accepted Answer · 2021-04-06 14:01:55Z

2

Try doing count distinct on a case when:

SELECT
  dp.name,
  COUNT(DISTINCT case when value is not null then year end) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

answered Apr 6, 2021 at 14:01

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Combining COUNT DISTINCT with FILTER - Spark SQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related