BigQuery - remove duplicate substrings from string

Question

I have a string which contains repeated substrings like

The big black dog big black dog is a friendly friendly dog who who lives nearby nearby

How can I remove these in BQ so that the result looks like this

The big black dog is a friendly dog who lives nearby

I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase

with text as (
  select "The big black dog big black dog is a friendly 
        friendly dog who who lives nearby nearby" as phrase
)

select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text

Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase

i.e. dog dog is considered a repetition of dog, but dog good dog is not

similarly for phrases:

good dog good dog is repetition of good dog

but good dog catch ball good dog is not considered a repetition

assuming the repeated word or phrase follows right after the first instance of a word or phrase — alex
– alex, Commented Oct 23, 2022 at 5:55
so "dog" is repeated only if it follows "dog", to answer your question, i.e. "dog dog" should become "dog" — alex
– alex, Commented Oct 23, 2022 at 5:58
but if it occurs somewhere else in the phrase as in original example it's not considered repeated — alex
– alex, Commented Oct 23, 2022 at 6:00

Jaytiger · Accepted Answer · 2022-10-24 05:03:04Z

2

In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.

regular expression explanation

CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
  const re = /(.{2,}) (?=\\1)/g;
  return s.replace(re, '');
""";

WITH sample_data AS (
  SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
  SELECT 'good dog good dog' UNION ALL
  SELECT 'good dog good dog good dog' UNION ALL
  SELECT 'good dog good dog good' UNION ALL
  SELECT 'dog dog dog' UNION ALL
  SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;

edited Oct 24, 2022 at 5:03

answered Oct 24, 2022 at 1:28

Jaytiger

12.3k2 gold badges7 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mikhail Berlyant · Accepted Answer · 2022-10-23 19:40:15Z

Consider below naive approach

with recursive candidates as (
  select id, phrase, offset_a, len, string_agg(word_b, ' ' order by offset_b) seq
  from your_table, unnest([0,1,2,3,4,5]) as len,
  unnest(split(phrase, ' ')) word_a with offset offset_a  
  join unnest(split(phrase, ' ')) word_b with offset offset_b
  on offset_b between offset_a and offset_a + len  
  group by id, phrase, offset_a, len
), dups as (
  select id,phrase, seq, row_number() over(partition by phrase order by array_length(split(seq, ' ')) desc) as pos
  from (
    select distinct t1.id, t1.phrase, t1.offset_a, t1.seq
    from candidates t1 
    join candidates t2
    on t1.seq = t2.seq
    and t1.phrase = t2.phrase
    and t1.offset_a = t2.offset_a + t2.len + 1
  )
), iterations as (
  select id, phrase, 1 as pos, phrase || ' ' as dedupped_phrase from your_table
  union all 
  select i.id, i.phrase, pos + 1, regexp_replace(dedupped_phrase, r'(' || seq || ' )+', '' || seq || ' ')
  from iterations i
  join dups d
  using(pos, phrase, id)
) 
select id, phrase, trim(dedupped_phrase) as dedupped_phrase
from iterations
qualify 1 = row_number() over(partition by phrase order by pos desc)

So, assuming your_table is

the output is

Step 1 - CTE named candidates - here you are getting all possible words sequences to then consider in next step. Note: here in unnest([0,1,2,3,4,5]) len you are setting possible length of sequebces (number of words) to consider. you can use unnest(generate_array(0,5)) as len instead if you want to save some typing :o)
Step 2 - CTE named dups identifies duplicate sequences obtained in previous step and ranks them from most lengthy (by words count) to least ones
Step 3- CTE named iterations actually goes and eliminates dups one by one (for every phrase)
Step 4 - finally, the last SELECT retrieves only last iterations for each initial phrase

Collectives™ on Stack Overflow

BigQuery - remove duplicate substrings from string

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related