2

I'm trying to find the most efficient way to remove overlapping substrings from a string field value on BigQuery. My use case is the same as Combining multiple regex substitutions but within BigQuery.

If I sum up the post above:

With the following list of substrings: ["quick brown fox", "fox jumps"]

I want:

A quick brown fox jumps over the lazy dog to be replaced by A over the lazy dog.

My thoughts were to come up with a JS UDF that does a similar job than what's mentioned in the post above i.e. to create a mask of the whole string and loop over the substrings to identify which characters to remove... But do you have better ideas?

Thanks for your help

4
  • So the algorithm is basically: Find all substrings' start and end position, combine overlapping positions and remove accordingly, right? Commented Apr 22, 2020 at 11:35
  • Do you have a fixed list of sub-strings you want to remove from the strings in your column? I believe you can definitely use UDF for this, although depending on the use case you can also use REPLACE() method in StandardSQL. Commented Apr 22, 2020 at 12:33
  • @MartinWeitzmann Yes exactly. I've tried a JS implementation, it seems to work. Commented Apr 22, 2020 at 13:51
  • @AlexandreMoraes my use case what a frozen list of keywords and I couldn't find out how to do this in Standard SQL. Commented Apr 22, 2020 at 14:00

2 Answers 2

2

I couldn't find out how to do this in Standard SQL

Below is for BigQuery Standard SQL and does whole thing in one shot - just one [simple] query

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'A quick brown fox jumps over the lazy dog' text 
), list AS (
  SELECT ['quick brown fox', 'fox jumps'] phrases
)
SELECT text AS original_text, REGEXP_REPLACE(text, STRING_AGG(pattern, '|'), '') processed_text FROM (
 SELECT DISTINCT text, SUBSTR(text, MIN(start),  MAX(finish) - MIN(start) + 1) pattern FROM (
  SELECT *, COUNTIF(flag) OVER(PARTITION BY text ORDER BY start) grp FROM (
   SELECT *, start > LAG(finish) OVER(PARTITION BY text ORDER BY start) flag FROM (
    SELECT *, start + phrase_len - 1 AS finish FROM (
     SELECT *, LENGTH(cut) + 1 + OFFSET * phrase_len + IFNULL(SUM(LENGTH(cut)) OVER(win), 0) start
     FROM `project.dataset.table`, list, 
     UNNEST(phrases) phrase, 
     UNNEST([LENGTH(phrase)]) phrase_len,
     UNNEST(REGEXP_EXTRACT_ALL(text, r'(.+?)' || phrase)) cut WITH OFFSET
     WINDOW win AS (PARTITION BY text, phrase ORDER BY OFFSET ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) 
 )))) GROUP BY text, grp
) GROUP BY text

with output

Row original_text                               processed_text   
1   A quick brown fox jumps over the lazy dog   A over the lazy dog    

I tested above with few more complex / tricky texts and it still worked

Brief explanation:

  1. gather all inclusions of phrases in list and their respective starts and ends
  2. combine overlapping fragments and calculate their respective starts and ends
  3. extract new fragments based on starts and end from above step 2
  4. order DESC them by length and generate regexp expression
  5. finally do REGEXP_REPLACE using regexp generated in above step 4

Above might look messy - but in reality it does all above in one query and in pure SQL

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks a lot. I'll try this against the JS udf asap! Impressive SQL!
great! share results then
Small caveat that I have, contrary to the js below, the banana test isn't covered by this standardSQL piece: if my sentence is banana and one of my keywords ana, this code will give me bna instead of b.
Confirming - this particular (and similar) cases - not supported here. specifically REGEXP_EXTRACT_ALL does not extract two entries of ana from banana but rather only one - this is a limitation of BigQuery's implementation of regexp. from documentation - The REGEXP_EXTRACT_ALL function only returns non-overlapping matches
1

Using a custom JS UDF seems to work, but i've seen faster BigQuery..!

    CREATE FUNCTION `myproject.mydataset.keyword_remover_js`(label STRING) RETURNS STRING LANGUAGE js AS """

    var keywords = ["a quick brown fox", "fox jumps"] ;

    var mask = new Array(label.length).fill(1);
    var reg = new RegExp("(" + keywords.join("|") + ")", 'g');
    var found;

    while (found = reg.exec(label)) { 

        for (var i = found.index; i < reg.lastIndex; i++) {
            mask[i] = 0;
            }
        reg.lastIndex = found.index+1;
    }

    var result = []

    for (var i = 0; i < label.length; i++) {
        if (mask[i]) {
            result.push(label[i])
        }
    }

    return result.join('').replace(/ +/g,' ').replace(/^ +| +$/,'')
    """;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.