2

I am looking for a regular expressions pattern which will remove articles(a, an, the), special chars(;,:,% etc) and expand abbreviation(inc.-> 'incorporation', & -> 'and' etc) in snowflake. I am able to do this in snowflake but it not completely correct. Below is my code. The issue is that i want to give pattern (for example output of 'a good book' should be 'good book' but string 'give a book' should remain as

'''
select REGEXP_REPLACE((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((
select REGEXP_REPLACE ((


select REGEXP_REPLACE (

  
  (select REGEXP_REPLACE(concat (' ', lower('a book of the great man'), ' '), '(^an )|(^the )| 
  (^a )'))
  , '\\.|\\,|\\(|\\)|\\!|\\\\|/|£|\\$|%|\\^|\\*|-|\\+|=|_|{|}|\\[|\\]|#|~|;|:|''|`|@|<|>|\\?| 
 ¬|\\|')

  ), ' & ', ' and ')
  ), ' ltd ', ' limited ')

  ), '', '')
  '''
2
  • So, is the requirement to remove the articles from the beginning of the statements, that's what I understood from the example which you posted. And for special chars it needs to be removed from any occurrences or places? Commented Jan 7, 2022 at 7:26
  • Yes Srinath, thats correct Commented Jan 7, 2022 at 8:15

2 Answers 2

4

Instead of using REGEXP_REPLACE, I suggest you write a UDF (JavaScript or Java), and use regexp of JavaScript (or java). It will be much cleaner and maintainable.

https://docs.snowflake.com/en/sql-reference/user-defined-functions.html

Here is a sample function:

CREATE OR REPLACE FUNCTION transform_text (STR VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS $$
  var abbreviations = { "inc.": "incorporation", "&": "and" };

  // remove articles from the beginning
  var Result = STR.replace( /^(a|an|the) /i, "" );

  // remove the special characters
  Result = Result.replace( /(;|,|:|%)/g, "" );

  // convert abbreviations
  for (var abv in abbreviations) Result = Result.replace( abv, abbreviations[abv] );

  return (Result);
$$
;

select transform_text( 'A good, a:; bo%ok & hoyd inc.' ) as Result;


+------------------------------------+
|               RESULT               |
+------------------------------------+
| good a book and hoyd incorporation |
+------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

3

A couple tweaks on the excellent answer from Gokhan.

  1. Convert the abbreviations prior to removing special chars
  2. Special chars easier to remove with the ^ not one of these approach
  3. Using \b to trap the word for the articles

enter image description here

CREATE OR REPLACE FUNCTION transform_text_2 (STR VARCHAR)
RETURNS VARCHAR
LANGUAGE JAVASCRIPT
AS $$
  var abbreviations = { "inc.": "incorporation", "&": "and" };

  // remove articles from the beginning
  var Result = STR.replace( /\b(an?|the)\b /i, "" );

  // convert abbreviations
  for (var abv in abbreviations) Result = Result.replace( abv, abbreviations[abv] );

  // remove the special characters
  Result = Result.replace( /[^A-Za-z0-9 ]/g, "" );


  return (Result);
$$
;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.