2

I have several 1.000 URLs and want to extract some values from the URL parameters. Here some examples from the DB:

["www.xxx.com?uci=6666&rci=fefw"]
["www.xxx.com?uci=61
["www.xxx.com?rci=62&uci=5536"]
["www.xxx.com?uci=6666&utm_source=XXX"]
["www.xxx.com?pccst=TEST%20sTESTg"]
["www.xxx.com?pccst=TEST2%20s&uci=1"]
["www.xxx.com?uci=1pccst=TEST42rt24&rci=2"]

How can I extract the value of the parameter UCI. It is always a digit number (don’t know the exact length). I tried it with REGEXP_EXTRACT. But I didn't succeed:

REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract

And I also want to extract the value of the parameter pccst. It can be every character and I don`t know the exact length. But it always ends with “ or ? or &

I tried it also with REGEXP_EXTRACT but didn't succeed:

REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract

I am really not the REGEX expert. So would be great if someone could help me. Thanks a lot in advance, Peter

2 Answers 2

3

You can adapt this solution

#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# @see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
  SELECT 1   AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query 
  UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
  UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT 
  id, 
  query,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. To extract all query parameters from a URL as ARRAY is perfect!!!!!
1

Below example for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
  SELECT "www.xxx.com?uci=61" UNION ALL
  SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
  SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
  SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
  SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
  SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2" 
)
SELECT 
  url, 
  REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
  REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`   

result is

Row url                                         uci     pccst    
1   www.xxx.com?pccst=TEST%20sTESTg             null    TEST%20sTESTg    
2   www.xxx.com?pccst=TEST2%20s&uci=1           1       TEST2%20s    
3   www.xxx.com?uci=1&pccst=TEST42rt24&rci=2    1       TEST42rt24   
4   www.xxx.com?uci=61                          61      null     
5   www.xxx.com?rci=62&uci=5536                 5536    null     
6   www.xxx.com?uci=6666&rci=fefw               6666    null     
7   www.xxx.com?uci=6666&utm_source=XXX         6666    null        

Also, below option to parse out all key-value pairs so, then you can dynamically select needed

#standardSQL
WITH `project.dataset.table` AS (
  SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
  SELECT "www.xxx.com?uci=61" UNION ALL
  SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
  SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
  SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
  SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
  SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2" 
)
SELECT url, 
  ARRAY(
    SELECT AS STRUCT 
      SPLIT(kv, '=')[SAFE_OFFSET(0)] key, 
      SPLIT(kv, '=')[SAFE_OFFSET(1)] value 
    FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
  ) key_value_pair
FROM `project.dataset.table`

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.