REGEXP_REPLACE Strings Starting and Ending with Specific Substrings in Snowflake

Question

I am trying to create a column in a view in Snowflake that replaces any string between strings that I care about with nothing.

This is essentially for the purpose of stripping html formatting out of text. As an example:

&lt;ul&gt;
&lt;li&gt;Text I care about 1
&lt;li&gt;Text I care about 2&lt;/li&gt;
&lt;li&gt;Text I care about 3&lt;/li&gt;
&lt;/ul&gt;

Would should end up like this:


Text I care about 1
Text I care about 2
Text I care about 3

Based on the patterns I am seeing, I think that if I can eliminate any string starting with &lt, and ending with >, I should be able to achieve the result I am looking for.

In testing on different sites it seems like expression REGEXP_REPLACE(originaltext, '&lt.+?>','') should, work, but when attempting in Snowflake it seems to be cutting off the last 'Text I care about' in some cases, and in other cases just is not showing any results at all. I am not sure if there is a syntax difference or something else off in the version of regex snowflake is using, but any advice would be appreciated.

Greg Pavlik · Accepted Answer · 2021-11-02 17:14:47Z

1

Your regular expression works, but it requires lookarounds.

set sample1 = '&lt;ul&gt;';
set sample2 = '&lt;li&gt;Text I care about 1';
set sample3 = '&lt;li&gt;Text I care about 2&lt;/li&gt;';
set sample4 = '&lt;li&gt;Text I care about 3&lt;/li&gt;';
set sample5 = '&lt;/ul&gt;';

select regexp_replace2($SAMPLE1,'&lt.+?&gt;','');  
select regexp_replace2($SAMPLE2,'&lt.+?&gt;','');
select regexp_replace2($SAMPLE3,'&lt.+?&gt;','');
select regexp_replace2($SAMPLE4,'&lt.+?&gt;','');
select regexp_replace2($SAMPLE5,'&lt.+?&gt;','');

I wrote a UDF library that supports regular expression lookarounds. It attempts to approximate the built-in Snowflake regular expression functions while supporting lookarounds. The names of the UDFs are the same as the built-in regular expression functions with the suffix "2" as shown in the SQL sample.

https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions

edited Nov 2, 2021 at 17:14

answered Nov 2, 2021 at 17:07

Greg Pavlik

11.3k2 gold badges14 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kanderson Over a year ago

Perfect! I will spend some more time familiarizing myself with lookarounds and what your code is doing, but it does exactly what I was trying to do. Thank you!

Rick · Accepted Answer · 2021-11-02 16:00:20Z

1

Not elegant, but if you know all the special encodings you want to remove, maybe you can just list them like that?

select REGEXP_REPLACE('&lt;li&gt;Text I care about 3&lt;/li&gt;', '(&lt;)|(li&gt;)|(/li&gt;)','')

answered Nov 2, 2021 at 16:00

Rick

2,13016 silver badges27 bronze badges

Comments

Sergiu · Accepted Answer · 2021-11-02 17:19:09Z

0

Your challenge is the fact that you are using a LAZY quantifier (.+?) and Snowflake doesn't supports it as according to our docs:

Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extendedsection (in Wikipedia).

The Wikipedia link shows that LAZY is NOT covered by the ERE standard, but is it an extension.

In your case you could maybe use a REGEXP_SUBSTR, like this:

SELECT REGEXP_SUBSTR('&lt;li&gt;Text I care about 1&lt;/li&gt;', '(\\w+\\s)+\\d');

with output like:

Text I care about 1

but this requires a specific pattern on your data.

answered Nov 2, 2021 at 17:19

Sergiu

4,7661 gold badge16 silver badges23 bronze badges

Collectives™ on Stack Overflow

REGEXP_REPLACE Strings Starting and Ending with Specific Substrings in Snowflake

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related