0

I'm trying to remove text in URLs so that a URL like https://www.nike.com/w/nike-free-5-shoes-3apemzy7ok would becomes /w/nike-free-5-shoes-3apemzy7ok

or https://www.kohls.com/search/mens.jsp becomes /search/mens.jsp

I can't use a RIGHT function, as there are multiple different domains, so the amount of characters it has to move changes from a case to case basis.

Does anyone know how to write a SQL query that can support this effort?

What I was thinking was something that looks for the ".com" and uses a wild card to remove ".com" + everything before the ".com"

That said, I haven't been able to figure out how to do this after a fair amount of research.

Appreciate the help!

3 Answers 3

1

Below is for BigQuery Standard SQL

REGEXP_EXTRACT(url, NET.HOST(url) || '[^/]*/(.+)')

You can test, play with above using sunny data as in example below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'https://www.nike.com/w/nike-free-5-shoes-3apemzy7ok' AS url UNION ALL
  SELECT 'https://www.kohls.com/search/mens.jsp' UNION ALL
  SELECT 'www.Example.Co.UK/1/2/3' UNION ALL
  SELECT 'www.Example.Co.UK:80/1/2/3' UNION ALL
  SELECT 'https://www.Example.Co.UK:80/1/2/3' 
)
SELECT url, 
  REGEXP_EXTRACT(url, NET.HOST(url) || '[^/]*/(.+)') path
FROM `project.dataset.table`   

with output

Row url                                                     path     
1   https://www.nike.com/w/nike-free-5-shoes-3apemzy7ok     w/nike-free-5-shoes-3apemzy7ok   
2   https://www.kohls.com/search/mens.jsp                   search/mens.jsp  
3   www.Example.Co.UK/1/2/3                                 1/2/3    
4   www.Example.Co.UK:80/1/2/3                              1/2/3    
5   https://www.Example.Co.UK:80/1/2/3                      1/2/3    
Sign up to request clarification or add additional context in comments.

Comments

0

If none of the URLs defeat the logic of "find the first slash after the first double slash" you could:

SELECT SUBSTRING(url, CHARINDEX('/', url, CHARINDEX('//', url) + 2)) + 1, 9999)

In English this is "substring starting (just after the index of the first slash in the url starting from just after the index of the first double slash) and a length longer than the rest of the string (= take to the end of the string)

Comments

0

In BigQuery, you can use regexp_extract():

select regexp_replace(url, '^.*//[^/]+/(.*)$', '\\1')
from (select 'https://www.nike.com/w/nike-free-5-shoes-3apemzy7ok' as url union all
      select 'https://www.kohls.com/search/mens.jsp' 
     ) x;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.