Using BigQuery, I want to group pages depending on their title with one query and calculate different metrics on the groups. As the rules on titles are not mutually exclusive, I've done it this way:
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
Here is the result:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
However, the number 3414469869 for the views of the pages that don't belong to any group is wrong. Indeed, when a title doesn't contain "game" (or "sport"), we get UNNEST([null, "sports_group"]) (or UNNEST(["games_group", null])) so we still count the views for the null group. When a title doesn't contain "game" neither "sport", the views are even counted twice.
Is there a way to remove duplicates from the array ?