3

May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?

Parsing conditions:

  • Start capturing should be:
    • If there is a // in the url: after the first // occurrence
    • If there is not a //: from the beginning of the string
  • End capturing should be: after the first ? or the first / or the first & or until the end of the string if no ?, / or & are found

Some examples:

htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com

I created this REGEX:

REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keywor‌​d=foo';, r'//([^/|^?|^&]+)')

But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.

6 Answers 6

6

BigQuery provides the following three functions:

HOST() -- Given a URL, returns the hostname as a string.

DOMAIN()-- Given a URL, returns the domain as a string.

TLD() -- Given a URL, returns the top level domain plus any country domain in the URL.

Sign up to request clarification or add additional context in comments.

2 Comments

BigQuery now uses NET.HOST() and NET.REG_DOMAIN() instead.
@RDRR This should be tagged as the main answer now.
6

For anyone looking for a solution using Standard SQL, the HOST() function is now under the NET namespace as NET.HOST(url): https://cloud.google.com/bigquery/docs/reference/standard-sql/net_functions#nethost

WITH
  examples AS (
  SELECT "https://some.domain.com/path?query=param#hash" AS example
  UNION ALL
  SELECT "some.domain.com/path?query=param#hash" AS example)
SELECT
  NET.HOST(example)
FROM
  examples

Returns:

some.domain.com
some.domain.com

Comments

1

Just to justify this question having BigQuery Tag (and not just regex) - consider below option

BigQuery Legacy SQL support set of URL Functions
Below is example of use in your case

SELECT 
  url, 
  HOST(REPLACE(CASE WHEN url CONTAINS '//' THEN url ELSE 'http://' + url END, '&', '?')) AS output
FROM
  (SELECT 'http://www.google.com' AS url),
  (SELECT 'htp://www.google.com/item/' AS url),
  (SELECT 'htp://www.google.com?source=google' AS url),
  (SELECT 'htp://www.google.com&source=google' AS url),
  (SELECT 'www.google.com' AS url),
  (SELECT 'www.google.com/item/' AS url),
  (SELECT 'www.google.com?source=google' AS url),
  (SELECT 'www.google.com&source=google' AS url),
  (SELECT 'http://google.com&source=google' AS url)

4 Comments

I would have preferred to learn how to create a regex to do this, but this is a nice way to solve the same issue, thanks, I will use it if no regex will be found!
understood. quick advise for you - learn and ask open question are two different things. if you want to learn - you should try first something - and then present specific issue and ask how to fix or address this issue. this way you have chance to learn. instead, you kind of outsourcing your learning to someone else's - so not much chances for progress. just thought this comment will help you change the way of using SO
those links might help you more: How to Ask and what is Minimal, Complete, and Verifiable example
Hi Mikhail, you are right, I'm new to the forum and I should have placed my not working solution in the body of the message (I just did it). In my subject I was asking for a solution using regex and I thought it was enough! Anyway I'm happy to have got your solution to the issue, if I can't find any regex i will use it, thanks!
1
'//([^/|^?|^&]+)'

Starting your regex with '//' => result need to start with '//'

you can do that

'(?://)([^/|^?|^&]+)'

Using '()' I create a match group but using ?: this matching group will not apears in the result

1 Comment

Thanks for feedback and explanation, but by doing like this it does work on url not having "//" like www.google.com but it does not work anymore with url like google.com because it catches "http:"
0

It might be something similar to

(w{0,3}\.*[a-z]+\.[a-z]*)

Explanation

should match any url with or without www

4 Comments

Thanks for your help! I'm afraid I was not enough clear in my example (I just edited the message) because it also should work on any other domain, like the one not starting with www. For example in case of "google.com&source=google" it should provide "google.com"
or even w{0,3}\.{0,1}
Hi thanks for feedback! It seems to me that I could remove the first part, can't I? I mean, just using: ([a-z]+\.[a-z]*) The only issue that I see is that it would not work with domains that contains "-" (it's an allowed char), like example-code.com/vb/string.asp Should I modify like this: ([a-z|-]+\.[a-z|-]*) ? Thanks!
@Jonk you can just use (w{0,3}\.{0,1}[a-z-]+\.[a-z-]*)
0

Would this work?

/b[\w.-]+(?:com|edu)

only works for '.com' and 'edu' addresses, but perhaps could be modified further.

****update****

Couldn't help playing with it. Here's one that will group the domain into a capturing group:

([\w.-]++(?!:)).*+

Requires support of lookaheads and it assumes there are line breaks between each url.

Basically it finds any series of letters, numbers, periods, or dashes not followed by a colon.

The colon is to prevent it from finding http:

The '.*+' is to consume the rest of the line so it doesn't continue to find matches after the first grouping.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.