4

I have a long string in BigQuery where that I need to extract out some data.

Part of the string looks like this:

... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...

I want to extract out data such as agent, hi, and 0.61254.

I'm trying to use regexp_extract but I can't get the regexp to work correctly:

select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]

What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?

Thank you in advance.

2 Answers 2

5

You can use

select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]

Here,

  • score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
  • resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
  • source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. This worked perfectly. And I especially appreciate the explanation of what regex is doing - really helped me understand as it is not intuitive.
1

I love non-trivial approaches - below one of such -

select * except(col) from (
  select col, split(kv, ': ')[offset(0)] key,
    trim(split(kv, ': ')[offset(1)], '"') value,
  from your_table,
  unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))     

if applied to sample data as in your question

with your_table as (
  select '...  source: "agent"  resolved_query: "hi"  score: 0.61254  parameters ... ' col union all
  select '...  source: "agent2"  resolved_query: "hello"  score: 0.12345  parameters ... ' col 
)         

the output is

enter image description here

As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.