3

I have a row with URL column.

I like to break the URL into domain and path. I can do domain by using Domain(URL) in BigQuery syntax.

My question is How do I get the path of the URL ?

e.g. http://www.somedomain.com/X/Y/abc

I want to get X, Y and abc as separate columsn.

3
  • good feature request, looking into it Commented Feb 25, 2014 at 1:11
  • Thanks. I am doing a like to like comparison with Microsoft Log Parser. Commented Feb 25, 2014 at 1:28
  • Unfair comparison - BigQuery is better (or not? I'd love to see your final tally). It could certainly benefit from having this feature - thanks for the request! Commented Feb 25, 2014 at 1:49

2 Answers 2

6

you can use REGEXP to extract what you need

SELECT Regexp_extract(URL,r'^http://www(?:[^/]*)/(.*)') as full_path,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){0}([^/]*)') as full_path0,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){1}([^/]*)') as full_path1,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){2}([^/]*)') as full_path2,
 Regexp_extract(URL,r'^http://www(?:[^/]*)/(?:[^/]*/){3}([^/]*)') as full_path3,
FROM 
(Select 'http://www.somedomain.com/X/Y/abc' as URL)

And regarding comparison with MS log parser.

  • Log Parser runs straight on the logs flat files while in BQ you need to load it 1st.
  • Log parser runs on a dedicated machine while BQ runs as a cloud service (many machine, you don't care how many...)
  • You'll find that performance wise BQ does things faster and with no concern of yours in regard to the resources available for processing. (Log parses can run multi-threads only as number of available CPU Units, and consumes a lot of cache of the machine it runs on )
  • the regex functions in BQ gives you all the flexibility in extracting any pattern of data from the logs.

Enjoy

Sign up to request clarification or add additional context in comments.

3 Comments

Is it possible to use variables in Big Query ?
Unfortunately no... (I think they have some feature requests for it) Our workaround is an offline process (python) that generates a query based on template & replaces some placeholder strings in it.
Awesome. For the domain part, one can add: Regexp_extract(pageurl,r'^(?:http:\/\/|www\.|https:\/\/)([^\/]+)')
-1

ga_sessions has hits leaf tables that breaks up your URL automatically

With your example of

http://www.somedomain.com/X/Y/abc

hits.page.pagePathLevel1 will have 'www.somedomian.com/'
hits.page.pagePathLevel2 will have '/X/'
hits.page.pagePathLevel3 will have '/Y/'

1 Comment

The question is generic to BigQuery, and not to Analytics data in BigQuery.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.