Snowflake SQL Regex ~ Extracting Multiple Vals

Question

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()

The values that I want to access are in quotes:

...
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
...

*There is a lot of text above and below this.

I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value. I have tried:

REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')

which yields:

Type: a: - !<string> val: "A"

and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.

Do you want a list of all values or being able to acces the nth value? — Hans Henrik Eriksen
– Hans Henrik Eriksen, Commented Dec 5, 2019 at 8:31
You seem to be happy using tokens in your string, eg. ellipsis ... to represent "any text". Are <string> and val also tokens representing something, or are the characters !<string> and val actually present in your data? — Hans Henrik Eriksen
– Hans Henrik Eriksen, Commented Dec 6, 2019 at 6:23
Yes, there are other !<string> and val in the string. Thank you for pointing that out. I would also like to be able to access the nth value. — datam
– datam, Commented Dec 6, 2019 at 7:03

Serkan Arslan · Accepted Answer · 2019-12-05 05:48:02Z

1

This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.

REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.

SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;

edited Dec 5, 2019 at 5:48

Serkan Arslan

13.4k4 gold badges33 silver badges46 bronze badges

answered Dec 5, 2019 at 5:26

dbaOnTap

1861 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hans Henrik Eriksen · Accepted Answer · 2019-12-06 07:20:15Z

0

You have to extract the values in two stages;

Extract the section of the document below Type: a: containing all the val: "data".
Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element

SELECT
  'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
  REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
  PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
  ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
  val_array[1] B
FROM INPUT_STRING

The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.

edited Dec 6, 2019 at 7:20

answered Dec 5, 2019 at 8:25

Hans Henrik Eriksen

2,9008 silver badges12 bronze badges

Comments

waldente · Accepted Answer · 2019-12-05 18:02:39Z

0

One more angle -- Use javascript regex capabilities in a UDF.

For example:

create or replace function my_regexp(S text)
  returns array
  language javascript
as
$$
  const re = /(\w+)/g
  return [...S.match(re)]
$$
;

Invoked this way:

set S = '
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
';

select my_regexp($S);

Yields:

[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]

Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.

That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

answered Dec 5, 2019 at 18:02

waldente

1,43410 silver badges13 bronze badges

1 Comment

Hans Henrik Eriksen Over a year ago

JavaScript RegExp is the real thing as opposed to the cutback version available via SQL functions. The elipsis [...] in @ making's input most likely represents endless words above and below the excerpt, so you end up extracting the words in the entire document. @ making wants only what follows Type: a:

Collectives™ on Stack Overflow

Snowflake SQL Regex ~ Extracting Multiple Vals

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related