Remove overlapping substrings within a BigQuery STRING field

Question

I'm trying to find the most efficient way to remove overlapping substrings from a string field value on BigQuery. My use case is the same as Combining multiple regex substitutions but within BigQuery.

If I sum up the post above:

With the following list of substrings: ["quick brown fox", "fox jumps"]

I want:

A quick brown fox jumps over the lazy dog to be replaced by A over the lazy dog.

My thoughts were to come up with a JS UDF that does a similar job than what's mentioned in the post above i.e. to create a mask of the whole string and loop over the substrings to identify which characters to remove... But do you have better ideas?

Thanks for your help

So the algorithm is basically: Find all substrings' start and end position, combine overlapping positions and remove accordingly, right? — Martin Weitzmann
– Martin Weitzmann, Commented Apr 22, 2020 at 11:35
Do you have a fixed list of sub-strings you want to remove from the strings in your column? I believe you can definitely use UDF for this, although depending on the use case you can also use REPLACE() method in StandardSQL. — Alexandre Moraes
– Alexandre Moraes, Commented Apr 22, 2020 at 12:33
@MartinWeitzmann Yes exactly. I've tried a JS implementation, it seems to work. — MissouPower
– MissouPower, Commented Apr 22, 2020 at 13:51
@AlexandreMoraes my use case what a frozen list of keywords and I couldn't find out how to do this in Standard SQL. — MissouPower
– MissouPower, Commented Apr 22, 2020 at 14:00

Mikhail Berlyant · Accepted Answer · 2020-04-23 00:27:48Z

2

I couldn't find out how to do this in Standard SQL

Below is for BigQuery Standard SQL and does whole thing in one shot - just one [simple] query

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'A quick brown fox jumps over the lazy dog' text 
), list AS (
  SELECT ['quick brown fox', 'fox jumps'] phrases
)
SELECT text AS original_text, REGEXP_REPLACE(text, STRING_AGG(pattern, '|'), '') processed_text FROM (
 SELECT DISTINCT text, SUBSTR(text, MIN(start),  MAX(finish) - MIN(start) + 1) pattern FROM (
  SELECT *, COUNTIF(flag) OVER(PARTITION BY text ORDER BY start) grp FROM (
   SELECT *, start > LAG(finish) OVER(PARTITION BY text ORDER BY start) flag FROM (
    SELECT *, start + phrase_len - 1 AS finish FROM (
     SELECT *, LENGTH(cut) + 1 + OFFSET * phrase_len + IFNULL(SUM(LENGTH(cut)) OVER(win), 0) start
     FROM `project.dataset.table`, list, 
     UNNEST(phrases) phrase, 
     UNNEST([LENGTH(phrase)]) phrase_len,
     UNNEST(REGEXP_EXTRACT_ALL(text, r'(.+?)' || phrase)) cut WITH OFFSET
     WINDOW win AS (PARTITION BY text, phrase ORDER BY OFFSET ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) 
 )))) GROUP BY text, grp
) GROUP BY text

with output

Row original_text                               processed_text   
1   A quick brown fox jumps over the lazy dog   A over the lazy dog

I tested above with few more complex / tricky texts and it still worked

Brief explanation:

gather all inclusions of phrases in list and their respective starts and ends
combine overlapping fragments and calculate their respective starts and ends
extract new fragments based on starts and end from above step 2
order DESC them by length and generate regexp expression
finally do REGEXP_REPLACE using regexp generated in above step 4

Above might look messy - but in reality it does all above in one query and in pure SQL

edited Apr 23, 2020 at 0:27

answered Apr 22, 2020 at 23:45

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MissouPower Over a year ago

Thanks a lot. I'll try this against the JS udf asap! Impressive SQL!

Mikhail Berlyant Over a year ago

great! share results then

MissouPower Over a year ago

Small caveat that I have, contrary to the js below, the banana test isn't covered by this standardSQL piece: if my sentence is banana and one of my keywords ana, this code will give me bna instead of b.

Mikhail Berlyant Over a year ago

Confirming - this particular (and similar) cases - not supported here. specifically REGEXP_EXTRACT_ALL does not extract two entries of ana from banana but rather only one - this is a limitation of BigQuery's implementation of regexp. from documentation - The REGEXP_EXTRACT_ALL function only returns non-overlapping matches

MissouPower · Accepted Answer · 2020-04-22 18:40:21Z

1

Using a custom JS UDF seems to work, but i've seen faster BigQuery..!

    CREATE FUNCTION `myproject.mydataset.keyword_remover_js`(label STRING) RETURNS STRING LANGUAGE js AS """

    var keywords = ["a quick brown fox", "fox jumps"] ;

    var mask = new Array(label.length).fill(1);
    var reg = new RegExp("(" + keywords.join("|") + ")", 'g');
    var found;

    while (found = reg.exec(label)) { 

        for (var i = found.index; i < reg.lastIndex; i++) {
            mask[i] = 0;
            }
        reg.lastIndex = found.index+1;
    }

    var result = []

    for (var i = 0; i < label.length; i++) {
        if (mask[i]) {
            result.push(label[i])
        }
    }

    return result.join('').replace(/ +/g,' ').replace(/^ +| +$/,'')
    """;

edited Apr 22, 2020 at 18:40

answered Apr 22, 2020 at 13:58

MissouPower

957 bronze badges

Collectives™ on Stack Overflow

Remove overlapping substrings within a BigQuery STRING field

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related