1

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()

The values that I want to access are in quotes:

...
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
...

*There is a lot of text above and below this.

I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value. I have tried:

REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')

which yields:

Type: a: - !<string> val: "A"

and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.

3
  • Do you want a list of all values or being able to acces the nth value? Commented Dec 5, 2019 at 8:31
  • You seem to be happy using tokens in your string, eg. ellipsis ... to represent "any text". Are <string> and val also tokens representing something, or are the characters !<string> and val actually present in your data? Commented Dec 6, 2019 at 6:23
  • Yes, there are other !<string> and val in the string. Thank you for pointing that out. I would also like to be able to access the nth value. Commented Dec 6, 2019 at 7:03

3 Answers 3

1

This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.

REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.enter image description here

SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;
Sign up to request clarification or add additional context in comments.

Comments

0

You have to extract the values in two stages;

  1. Extract the section of the document below Type: a: containing all the val: "data".
  2. Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element
SELECT
  'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
  REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
  PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
  ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
  val_array[1] B
FROM INPUT_STRING

The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.

Comments

0

One more angle -- Use javascript regex capabilities in a UDF.

For example:

create or replace function my_regexp(S text)
  returns array
  language javascript
as
$$
  const re = /(\w+)/g
  return [...S.match(re)]
$$
;

Invoked this way:

set S = '
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
';

select my_regexp($S);

Yields:

[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]

Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.

That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

1 Comment

JavaScript RegExp is the real thing as opposed to the cutback version available via SQL functions. The elipsis [...] in @ making's input most likely represents endless words above and below the excerpt, so you end up extracting the words in the entire document. @ making wants only what follows Type: a:

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.