1

I'm trying to split the ad_content with each "_" character, but I don't know why I can't go further than the 9th split word (splits[SAFE_OFFSET(8)] AS objective).

This is the query I'm using:

SELECT
    ad_content,
    splits[SAFE_OFFSET(0)] AS country,
    splits[SAFE_OFFSET(1)] AS product,
    splits[SAFE_OFFSET(2)] AS budget,
    splits[SAFE_OFFSET(3)] AS source,
    splits[SAFE_OFFSET(4)] AS campaign,
    splits[SAFE_OFFSET(5)] AS audience,
    splits[SAFE_OFFSET(6)] AS route_type,
    splits[SAFE_OFFSET(7)] AS business,
    splits[SAFE_OFFSET(8)] AS objective,
    splits[SAFE_OFFSET(9)] AS format,
    splits[SAFE_OFFSET(10)] AS nnn,
    splits[SAFE_OFFSET(11)] AS date,
FROM (
  SELECT
    AD_CONTENT,
    SPLIT(REGEXP_REPLACE(
            AD_CONTENT,
            r'([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_([^_]+)_(.+)',
            r'\1|\2|\3|\4|\5|\6|\7|\8|\9|\10|\11|\12'),
          '|') AS splits
  FROM ga_digital_marketing

For example, ad_content = us_latam_perf_facebook_black-friday_bbdd-push_SCL-CCP_domestic_conversion_push_all_20210906

And this is the results using the query from above:

ad_content country product budget source campaign audience route_type business objective format nnn date
us_latam_perf_facebook_black-friday_bbdd-push_SCL-CCP_domestic_conversion_push_all_20210906 us latam perf facebook black-friday bbdd-push SCL-CCP domestic conversion us0 us1 us2

As you can see above, from the format column (splits[SAFE_OFFSET(9)] AS format) are not giving the results correctly.

I believe that the problem is in here: r'\1|\2|\3|\4|\5|\6|\7|\8|\9|\10|\11|\12') because maybe the number 0 of |\10 is not recognizing it as a number but as a string. And that's why I have as a result us0 us1 and us2

Is there a solution for this limitation?

Is there another way to split the ad_content example?

2
  • Right, \n backreference syntax usually supports groups from 1 to 9. Try $10, $11 and $12. Commented Dec 10, 2021 at 14:18
  • @WiktorStribiżew Thank you for the comment. But it didn't worked. The console shows me this: " Invalid REGEXP_REPLACE pattern: Rewrite schema error: '\' must be followed by a digit or '\' " Commented Dec 10, 2021 at 14:44

1 Answer 1

1

BigQuery's REGEXP_REPLACE supports only \1 to \9 - that is why!

Is there a solution for this limitation?

Use below approach instead

SELECT
    -- ad_content,
    splits[SAFE_OFFSET(0)] AS country,
    splits[SAFE_OFFSET(1)] AS product,
    splits[SAFE_OFFSET(2)] AS budget,
    splits[SAFE_OFFSET(3)] AS source,
    splits[SAFE_OFFSET(4)] AS campaign,
    splits[SAFE_OFFSET(5)] AS audience,
    splits[SAFE_OFFSET(6)] AS route_type,
    splits[SAFE_OFFSET(7)] AS business,
    splits[SAFE_OFFSET(8)] AS objective,
    splits[SAFE_OFFSET(9)] AS format,
    splits[SAFE_OFFSET(10)] AS nnn,
    splits[SAFE_OFFSET(11)] AS date,
FROM (
  SELECT
    AD_CONTENT,
    SPLIT(AD_CONTENT, '_') AS splits
  FROM ga_digital_marketing
)    

if applied to sample in your question - output is

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

glad it worked for you. consider also voting up the answer :o)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.