0

I know there are a lot of questions about lazy regex matching, but none of the solutions I've seen have worked. Here's the issue:

Some addresses in my BigQuery results look like this:

www.example.comwww.example.com/path/to/page
apply.example.comapply.example.com/eapp/

I want to strip off the duplicated part to get

www.example.com/path/to/page
apply.example.com/eapp/

I've tried using REGEX_REPLACE() like this:

REGEXP_REPLACE(raw.query.tidy_Landing, r'(.*?)\.com','') AS Landing_Page

But that still finds both matches and so returns

/path/to/page
/eapp/

What's wrong with my regex?

2 Answers 2

2

#standardSQL
WITH t AS (
  SELECT 'www.example.comwww.example.com/path/to/page' str UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
)
SELECT str, REGEXP_REPLACE(str, r'^(.*?\.com)', '') fix
FROM t

output is

str                                             fix  
www.example.comwww.example.com/path/to/page     www.example.com/path/to/page  
apply.example.comapply.example.com/eapp/        apply.example.com/eapp/   
Sign up to request clarification or add additional context in comments.

1 Comment

Ah, I see. The ^ ensures that it's looking form the beginning of the string and stops after a first match. Otherwise it will keep returning matches even though it's lazy and they're distinct.
1

I wanted to see if it's possible to do this without a regex, although it ends up being a bit verbose :) This answer assumes that the address is always duplicated and ends with .com. Assuming that's the case, it should be possible to use SPLIT to extract the parts that you are interested in:

SELECT
  CONCAT(
    SPLIT(text, '.com')[OFFSET(0)],
    '.com',
    SPLIT(text, '.com')[OFFSET(2)]
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
);

If you want the query to be tolerant of non-duplicated addresses, you can make a slight modification:

SELECT
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
      )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

Going one step further, you can extract the logic into a UDF:

CREATE TEMP FUNCTION GetLandingPage(text STRING) AS (
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
       )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  )
);

SELECT
  GetLandingPage(text) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

1 Comment

This is very cool; always up for a UDF solution. Definitely going to tuck this away for more experimentation, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.