0

I have a string which contains repeated substrings like

The big black dog big black dog is a friendly friendly dog who who lives nearby nearby

How can I remove these in BQ so that the result looks like this

The big black dog is a friendly dog who lives nearby

I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase

with text as (
  select "The big black dog big black dog is a friendly 
        friendly dog who who lives nearby nearby" as phrase
)

select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text

Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase

i.e. dog dog is considered a repetition of dog, but dog good dog is not

similarly for phrases:

good dog good dog is repetition of good dog

but good dog catch ball good dog is not considered a repetition

6
  • dog is aslo a repeated substring as well, isn't it ? Commented Oct 23, 2022 at 5:49
  • assuming the repeated word or phrase follows right after the first instance of a word or phrase Commented Oct 23, 2022 at 5:55
  • so "dog" is repeated only if it follows "dog", to answer your question, i.e. "dog dog" should become "dog" Commented Oct 23, 2022 at 5:58
  • but if it occurs somewhere else in the phrase as in original example it's not considered repeated Commented Oct 23, 2022 at 6:00
  • added example to the question Commented Oct 23, 2022 at 6:06

2 Answers 2

2

In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.

regular expression explanation

CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
  const re = /(.{2,}) (?=\\1)/g;
  return s.replace(re, '');
""";

WITH sample_data AS (
  SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
  SELECT 'good dog good dog' UNION ALL
  SELECT 'good dog good dog good dog' UNION ALL
  SELECT 'good dog good dog good' UNION ALL
  SELECT 'dog dog dog' UNION ALL
  SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

2

Consider below naive approach

with recursive candidates as (
  select id, phrase, offset_a, len, string_agg(word_b, ' ' order by offset_b) seq
  from your_table, unnest([0,1,2,3,4,5]) as len,
  unnest(split(phrase, ' ')) word_a with offset offset_a  
  join unnest(split(phrase, ' ')) word_b with offset offset_b
  on offset_b between offset_a and offset_a + len  
  group by id, phrase, offset_a, len
), dups as (
  select id,phrase, seq, row_number() over(partition by phrase order by array_length(split(seq, ' ')) desc) as pos
  from (
    select distinct t1.id, t1.phrase, t1.offset_a, t1.seq
    from candidates t1 
    join candidates t2
    on t1.seq = t2.seq
    and t1.phrase = t2.phrase
    and t1.offset_a = t2.offset_a + t2.len + 1
  )
), iterations as (
  select id, phrase, 1 as pos, phrase || ' ' as dedupped_phrase from your_table
  union all 
  select i.id, i.phrase, pos + 1, regexp_replace(dedupped_phrase, r'(' || seq || ' )+', '' || seq || ' ')
  from iterations i
  join dups d
  using(pos, phrase, id)
) 
select id, phrase, trim(dedupped_phrase) as dedupped_phrase
from iterations
qualify 1 = row_number() over(partition by phrase order by pos desc)

So, assuming your_table is

enter image description here

the output is

enter image description here

  • Step 1 - CTE named candidates - here you are getting all possible words sequences to then consider in next step. Note: here in unnest([0,1,2,3,4,5]) len you are setting possible length of sequebces (number of words) to consider. you can use unnest(generate_array(0,5)) as len instead if you want to save some typing :o)
  • Step 2 - CTE named dups identifies duplicate sequences obtained in previous step and ranks them from most lengthy (by words count) to least ones
  • Step 3- CTE named iterations actually goes and eliminates dups one by one (for every phrase)
  • Step 4 - finally, the last SELECT retrieves only last iterations for each initial phrase

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.