2

I am trying to create a column in a view in Snowflake that replaces any string between strings that I care about with nothing.

This is essentially for the purpose of stripping html formatting out of text. As an example:

<ul>
<li>Text I care about 1
<li>Text I care about 2</li>
<li>Text I care about 3</li>
</ul>

Would should end up like this:


Text I care about 1
Text I care about 2
Text I care about 3

Based on the patterns I am seeing, I think that if I can eliminate any string starting with &lt, and ending with >, I should be able to achieve the result I am looking for.

In testing on different sites it seems like expression REGEXP_REPLACE(originaltext, '&lt.+?>','') should, work, but when attempting in Snowflake it seems to be cutting off the last 'Text I care about' in some cases, and in other cases just is not showing any results at all. I am not sure if there is a syntax difference or something else off in the version of regex snowflake is using, but any advice would be appreciated.

3 Answers 3

1

Your regular expression works, but it requires lookarounds.

set sample1 = '<ul>';
set sample2 = '<li>Text I care about 1';
set sample3 = '<li>Text I care about 2</li>';
set sample4 = '<li>Text I care about 3</li>';
set sample5 = '</ul>';

select regexp_replace2($SAMPLE1,'&lt.+?>','');  
select regexp_replace2($SAMPLE2,'&lt.+?>','');
select regexp_replace2($SAMPLE3,'&lt.+?>','');
select regexp_replace2($SAMPLE4,'&lt.+?>','');
select regexp_replace2($SAMPLE5,'&lt.+?>','');

I wrote a UDF library that supports regular expression lookarounds. It attempts to approximate the built-in Snowflake regular expression functions while supporting lookarounds. The names of the UDFs are the same as the built-in regular expression functions with the suffix "2" as shown in the SQL sample.

https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions

Sign up to request clarification or add additional context in comments.

1 Comment

Perfect! I will spend some more time familiarizing myself with lookarounds and what your code is doing, but it does exactly what I was trying to do. Thank you!
1

Not elegant, but if you know all the special encodings you want to remove, maybe you can just list them like that?

select REGEXP_REPLACE('<li>Text I care about 3</li>', '(<)|(li>)|(/li>)','')

Comments

0

Your challenge is the fact that you are using a LAZY quantifier (.+?) and Snowflake doesn't supports it as according to our docs:

Patterns support the full POSIX ERE (Extended Regular Expression) syntax. For details, see the POSIX basic and extendedsection (in Wikipedia).

The Wikipedia link shows that LAZY is NOT covered by the ERE standard, but is it an extension.

In your case you could maybe use a REGEXP_SUBSTR, like this:

SELECT REGEXP_SUBSTR('<li>Text I care about 1</li>', '(\\w+\\s)+\\d');

with output like:

Text I care about 1

but this requires a specific pattern on your data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.