Extract full domain from url in Google BigQuery using regex

Question

May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?

Parsing conditions:

Start capturing should be:
- If there is a // in the url: after the first // occurrence
- If there is not a //: from the beginning of the string
End capturing should be: after the first ? or the first / or the first & or until the end of the string if no ?, / or & are found

Some examples:

htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com

I created this REGEX:

REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keywor‌d=foo';, r'//([^/|^?|^&]+)')

But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.

tenideas · Accepted Answer · 2017-12-18 14:45:19Z

6

BigQuery provides the following three functions:

HOST() -- Given a URL, returns the hostname as a string.

DOMAIN()-- Given a URL, returns the domain as a string.

TLD() -- Given a URL, returns the top level domain plus any country domain in the URL.

answered Dec 18, 2017 at 14:45

tenideas

611 silver badge2 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RDRR Over a year ago

BigQuery now uses NET.HOST() and NET.REG_DOMAIN() instead.

Naveen Kumar Over a year ago

@RDRR This should be tagged as the main answer now.

Lewis Hemens · Accepted Answer · 2019-07-10 16:48:30Z

6

For anyone looking for a solution using Standard SQL, the HOST() function is now under the NET namespace as NET.HOST(url): https://cloud.google.com/bigquery/docs/reference/standard-sql/net_functions#nethost

WITH
  examples AS (
  SELECT "https://some.domain.com/path?query=param#hash" AS example
  UNION ALL
  SELECT "some.domain.com/path?query=param#hash" AS example)
SELECT
  NET.HOST(example)
FROM
  examples

Returns:

some.domain.com
some.domain.com

answered Jul 10, 2019 at 16:48

Lewis Hemens

611 silver badge3 bronze badges

Comments

Mikhail Berlyant · Accepted Answer · 2016-11-23 22:25:51Z

1

Just to justify this question having BigQuery Tag (and not just regex) - consider below option

BigQuery Legacy SQL support set of URL Functions
Below is example of use in your case

SELECT 
  url, 
  HOST(REPLACE(CASE WHEN url CONTAINS '//' THEN url ELSE 'http://' + url END, '&', '?')) AS output
FROM
  (SELECT 'http://www.google.com' AS url),
  (SELECT 'htp://www.google.com/item/' AS url),
  (SELECT 'htp://www.google.com?source=google' AS url),
  (SELECT 'htp://www.google.com&source=google' AS url),
  (SELECT 'www.google.com' AS url),
  (SELECT 'www.google.com/item/' AS url),
  (SELECT 'www.google.com?source=google' AS url),
  (SELECT 'www.google.com&source=google' AS url),
  (SELECT 'http://google.com&source=google' AS url)

answered Nov 23, 2016 at 22:25

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

4 Comments

Jonk Over a year ago

I would have preferred to learn how to create a regex to do this, but this is a nice way to solve the same issue, thanks, I will use it if no regex will be found!

Mikhail Berlyant Over a year ago

understood. quick advise for you - learn and ask open question are two different things. if you want to learn - you should try first something - and then present specific issue and ask how to fix or address this issue. this way you have chance to learn. instead, you kind of outsourcing your learning to someone else's - so not much chances for progress. just thought this comment will help you change the way of using SO

Mikhail Berlyant Over a year ago

those links might help you more: How to Ask and what is Minimal, Complete, and Verifiable example

Jonk Over a year ago

Hi Mikhail, you are right, I'm new to the forum and I should have placed my not working solution in the body of the message (I just did it). In my subject I was asking for a solution using regex and I thought it was enough! Anyway I'm happy to have got your solution to the issue, if I can't find any regex i will use it, thanks!

baddger964 · Accepted Answer · 2016-11-23 21:46:01Z

1

'//([^/|^?|^&]+)'

Starting your regex with '//' => result need to start with '//'

you can do that

'(?://)([^/|^?|^&]+)'

Using '()' I create a match group but using ?: this matching group will not apears in the result

answered Nov 23, 2016 at 21:46

baddger964

1,2299 silver badges19 bronze badges

1 Comment

Jonk Over a year ago

Thanks for feedback and explanation, but by doing like this it does work on url not having "//" like www.google.com but it does not work anymore with url like google.com because it catches "http:"

Anton Balaniuc · Accepted Answer · 2016-11-23 21:55:40Z

0

It might be something similar to

(w{0,3}\.*[a-z]+\.[a-z]*)

Explanation

should match any url with or without www

edited Nov 23, 2016 at 21:55

answered Nov 23, 2016 at 21:34

Anton Balaniuc

11.8k2 gold badges40 silver badges53 bronze badges

4 Comments

Jonk Over a year ago

Thanks for your help! I'm afraid I was not enough clear in my example (I just edited the message) because it also should work on any other domain, like the one not starting with www. For example in case of "google.com&source=google" it should provide "google.com"

Anton Balaniuc Over a year ago

or even w{0,3}\.{0,1}

Jonk Over a year ago

Hi thanks for feedback! It seems to me that I could remove the first part, can't I? I mean, just using: ([a-z]+\.[a-z]*) The only issue that I see is that it would not work with domains that contains "-" (it's an allowed char), like example-code.com/vb/string.asp Should I modify like this: ([a-z|-]+\.[a-z|-]*) ? Thanks!

Anton Balaniuc Over a year ago

@Jonk you can just use (w{0,3}\.{0,1}[a-z-]+\.[a-z-]*)

shrug · Accepted Answer · 2016-11-24 23:49:07Z

0

Would this work?

/b[\w.-]+(?:com|edu)

only works for '.com' and 'edu' addresses, but perhaps could be modified further.

****update****

Couldn't help playing with it. Here's one that will group the domain into a capturing group:

([\w.-]++(?!:)).*+

Requires support of lookaheads and it assumes there are line breaks between each url.

Basically it finds any series of letters, numbers, periods, or dashes not followed by a colon.

The colon is to prevent it from finding http:

The '.*+' is to consume the rest of the line so it doesn't continue to find matches after the first grouping.

edited Nov 24, 2016 at 23:49

answered Nov 24, 2016 at 1:57

shrug

475 bronze badges

Collectives™ on Stack Overflow

Extract full domain from url in Google BigQuery using regex

6 Answers 6

2 Comments

Comments

4 Comments

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

Comments

4 Comments

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related