1

Given the two tables below, for each datapoint, I want to count the number of distinct years for which we have a value. However, Spark SQL does not allow combining COUNT DISTINCT and FILTER.

CREATE TABLE datapoints (name STRING);

INSERT INTO
  datapoints
VALUES
  ('Name'),
  ('Height'),
  ('Color');

CREATE TABLE entities (datapoint STRING, year INT, value STRING);

INSERT INTO
  entities
VALUES
  ('Name', 2015, 'John'),
  ('Name', 2015, 'Suzan'),
  ('Name', 2017, 'Jim'),
  ('Color', 2015, 'Blue')

SELECT
  dp.name,
  COUNT(DISTINCT year) FILTER (
    WHERE
      value IS NOT NULL
  ) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name

Results in:

Error in SQL statement: AnalysisException: DISTINCT and FILTER cannot be used in aggregate functions at the same time; line 3 pos 2

What would be the functionally equivalent valid Spark SQL statement? The expected output is (notice the duplicate year for 'Name'):

name DPCount
Color 1
Height 0
Name 2

1 Answer 1

2

Try doing count distinct on a case when:

SELECT
  dp.name,
  COUNT(DISTINCT case when value is not null then year end) as DPCount  
FROM
  datapoints as dp
  LEFT JOIN entities on datapoint = dp.name
GROUP BY
  dp.name
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.