BigQuery - Remove duplicates from array

Question

Using BigQuery, I want to group pages depending on their title with one query and calculate different metrics on the groups. As the rules on titles are not mutually exclusive, I've done it this way:

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
    CASE WHEN (title LIKE '%game%') 
    THEN 'games_group' END, 
    CASE WHEN (title LIKE '%sport%') 
    THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

Here is the result:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

However, the number 3414469869 for the views of the pages that don't belong to any group is wrong. Indeed, when a title doesn't contain "game" (or "sport"), we get UNNEST([null, "sports_group"]) (or UNNEST(["games_group", null])) so we still count the views for the null group. When a title doesn't contain "game" neither "sport", the views are even counted twice.

Is there a way to remove duplicates from the array ?

Gordon Linoff · Accepted Answer · 2020-03-26 14:24:40Z

2

How about adding another group?

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
     UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END, 
             CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
             CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
            ]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY title_group;

Note: This doesn't take NULL titles into account. I don't know if that is important.

However, I would phrase this using two columns:

SELECT (title LIKE '%game%') as is_game,
       (title LIKE '%sport%') as is_sport,
       SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY is_game, is_sport;

This doesn't return the same rows as yours -- games and sports are split across two rows. But you can see the combinations.

EDIT:

Now that I think of this, you just want a LEFT JOIN:

SELECT g.title_group, SUM(pv.views) as views, 
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
     (SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
      SELECT '%sport%', 'sports_group' as title_group UNION ALL
     ) g
     ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
GROUP BY g.title_group;

edited Mar 26, 2020 at 14:24

answered Mar 26, 2020 at 11:01

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rebecca Over a year ago

Yes, adding another group could be a solution ! For the second query, I can't use it as I really need only one column.

Gordon Linoff Over a year ago

@Rebecca . . . I think the edited solution is what you really want.

Mikhail Berlyant · Accepted Answer · 2020-03-26 14:00:59Z

0

Below is for BigQuery Standard SQL

#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
    CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN 
      [
        CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
        CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
      ]
      ELSE ['other']
    END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND   title_group IS NOT NULL
GROUP BY title_group

answered Mar 26, 2020 at 14:00

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

Collectives™ on Stack Overflow

BigQuery - Remove duplicates from array

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related