May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the full domain of a given input url?
Parsing conditions:
- Start capturing should be:
- If there is a
//in the url: after the first//occurrence - If there is not a
//: from the beginning of the string
- If there is a
- End capturing should be: after the first
?or the first/or the first&or until the end of the string if no?,/or&are found
Some examples:
htp://www.google.com --> www.google.com
htp://www.google.com/item/ --> www.google.com
htp://www.google.com?source=google --> www.google.com
htp://www.google.com&source=google --> www.google.com
www.google.com --> www.google.com
www.google.com/item/ --> www.google.com
www.google.com?source=google --> www.google.com
www.google.com&source=google --> www.google.com
http://google.com&source=google --> google.com
https://www.example-code.com/vb/string.asp --> www.example-code.com
I created this REGEX:
REGEXP_EXTRACT('google.it?medium=cpc?cobranded=google&keyword=foo';, r'//([^/|^?|^&]+)')
But it's working only for urls that contain //, I can't get to have a regex that works also in case no // are in the url.