I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.
# Select the ten domains that have the most stories
sql_domain <- "SELECT url REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//')
THEN url ELSE CONCAT('http://', url) END, '&', '?') as domain_name,
COUNT(domain_name) as story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY domain_name
ORDER BY story_number DESC
LIMIT 10"
I don't need to strip the top-level domain; for example, stackoverflow isn't required, stackoverflow.com is fine. Your help is greatly appreciated!
sql_domain_ag <- "SELECT NET.REG_DOMAIN(url) as domain_name, COUNT(domain_name) as story_numberAnd am now getting "Error: Unrecognized name: domain_name at [2:29] [invalidQuery]" so I must be calling the function improperly, or something.