4

May I ask your help in order to build a regular expression to be used on Google Big Query using REGEXP_EXTRACT that will parse the value of an url parameter identified by a specific key?

Let's suppose that the parameter I would like to parse has key value equal to "source". The parsing should:

  • Ensure that before the key there is "?" or a "&" and after there is a "=" : so in the example match "?source=" or "&source="
  • Capture the value until the first "&" or end of string
  • In case above conditions matches more than one time it should take the value of first occurance

Here are some example of desired behaviour (they should all provide "google" as output):

  • www.google.com?source=google&medium=cpc --> output: google
  • www.google.com?source=google --> output: google
  • www.google.com?medium=cpc&source=google --> output: google
  • www.google.com?medium=cpc&source=google&keyword=foo --> output: google
  • www.google.com?medium=cpc&source=google&keyword=foo&source=bing --> output: google
  • www.google.it?medium=cpc?source=goo-gle --> output: goo-gle
  • www.google.it?medium=cpc?source=google?med=cpc&keyword=foo --> output: google?med=cpc

Thanks very much for any help!

3 Answers 3

10

[?&]source=([^&]+)

The first captured group in the match will be the value of the "source" parameter.

  • [?&] Either ? or &
  • source= Literal text
  • ([^&]+) A captured group containing 1 or more characters that are not &
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot this works on all examples! May I ask you for my understanding what would be the difference of just using REGEXP_EXTRACT(url, r'[?&]source=([^&]+)') ? It seems that this works fine also in all my examples, but I guess there must be some scenario where it would perform differently? Can you provide it to me so to fully understand? Thanks! Marco
Sorry--not sure what the question is. In your case, the \r\n is probably optional. Does that help?
Those were added while I was testing and I forgot to take them out ;P
6

If you need to extract all parameters from a URL, you can also use REGEXP_EXTRACT_ALL as follows:

REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params

(Posting here because this question ranks highly on Google for "bigquery parse url query string", but the chosen answer only works for one parameter that is already defined).

This will return the result as an array (see How to extract URL parameters as ARRAY in Google BigQuery):

BigQuery example

1 Comment

maybe not directly related, but how would you do for this to be done once and for all ? as I imagine it's not the most efficient things to re-run the regexp on all logs every single time you query your data ?
0

The value of source can be extracted as follows:

select regexp_extract("www.google.it?medium=cpc&source=google&keyword=foo&source=bing", "[?&]source=([^&]+)")

1 Comment

Please add some explanation rather than posting only code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.