1

I am trying to solve this issue using regex_replace but wondering if there is a smarter to solve it and prevent me from adding more nested regex_replace functions in the future to account for each scenario.

See the following sqlfiddle for the set up. http://sqlfiddle.com/#!17/82948/12

The main issue I am trying to solve for is duplicate values "ACK" or "ZEBRA" or combination of them.

So basically it shouldn't have both ZEBRA and ACK in them. If it does, then take the ACK or ZEBRA closest to the number.

  1. ACK_ACK_DOV should be ACK_DOV
  2. ZEBRA_ZEBRA_DOV should be ZEBRA_DOV
  3. ZEBRA_ACK_ACK_DOV should be ACK_DOV
  4. ZEBRA_ZEBRA_ACK_DOV should be ACK_DOV
  5. ZEBRA_393939_DOV should be ZEBRA_393939_DOV
  6. ZEBRA_ZEBRA_29393930 should be ZEBRA_29393930
value fixed IDEAL
ACK_ACK_DOV_90000 ACK_DOV_90000 ACK_DOV_90000
ACK_910101 ACK_910101 ACK_910101
ACK_XIS_900000000 ACK_XIS_900000000 ACK_XIS_900000000
GGG_0000000 GGG_0000000 GGG_0000000
ASC_VNA_303930 ASC_VNA_303930 ASC_VNA_303930
ACK_393848489 ACK_393848489 ACK_393848489
ACK_VNA_30303 ACK_VNA_30303 ACK_VNA_30303
ACK_XPM_303030303030 ACK_XPM_303030303030 ACK_XPM_303030303030
ACK_ACK_DOV_39393 ACK_DOV_39393 ACK_DOV_39393
ZEBRA_0393930 ZEBRA_0393930 ZEBRA_0393930
ZEBRA_393939_DOV ZEBRA_393939_DOV ZEBRA_393939_DOV
ZEBRA_VNA_3930321 ZEBRA_VNA_3930321 ZEBRA_VNA_3930321
ZEBRA_ACK_ACK_DOV_3934994 ZEBRA_ACK_DOV_3934994 ACK_DOV_3934994
ZEBRA_ZEBRA_29393930 ZEBRA_ZEBRA_29393930 ZEBRA_29393930

Thank you in advance!!

4
  • great data on input/output expectations. Commented Apr 24, 2021 at 0:21
  • 1
    if you really want to do it with a regex, write a javascript UDF that uses the full posix regex of javascript, then you can use back matching. Commented Apr 24, 2021 at 2:00
  • Why does ZEBRA_ACK_ACK_DOV_3934994 translate to ZEBRA_ACK_VOD_3934994? (a) I thought you only wanted one of ZEBRA or ACK in the output, and secondly why is DOV translated to VOD? Commented Apr 24, 2021 at 3:14
  • @Nick you are 100% correct on both of those. I fixed the question so that 1) It's only ZEBRA or ACK. 2. It should be DOV. I missed it as I was trying to put this together for the post. Commented Apr 26, 2021 at 17:33

1 Answer 1

1

instead of using REGEX as there is not back-matching grammar available, turning your logic into splitting on underscores, count the occurrences "bad" of tokens and only keep good or the last bad, and glue them back together

with data(value,fixed,ideal) as (
    select * from values
        ('ACK_ACK_DOV_90000','ACK_VOD_90000','ACK_VOD_90000')
        ,('ACK_910101','ACK_910101','ACK_910101')
        ,('ACK_XIS_900000000','ACK_XIS_900000000','ACK_XIS_900000000')
        ,('GGG_0000000','GGG_0000000','GGG_0000000')
        ,('ASC_VNA_303930','ASC_VNA_303930','ASC_VNA_303930')
        ,('ACK_393848489','ACK_393848489','ACK_393848489')
        ,('ACK_VNA_30303','ACK_VNA_30303','ACK_VNA_30303')
        ,('ACK_XPM_303030303030','ACK_XPM_303030303030','ACK_XPM_303030303030')
        ,('ACK_ACK_DOV_39393','ACK_VOD_39393','ACK_VOD_39393')
        ,('ZEBRA_0393930','ZEBRA_0393930','ZEBRA_0393930')
        ,('ZEBRA_393939_DOV','ZEBRA_393939_DOV','ZEBRA_393939_DOV')
        ,('ZEBRA_VNA_3930321','ZEBRA_VNA_3930321','ZEBRA_VNA_3930321')
        ,('ZEBRA_ACK_ACK_DOV_3934994','ZEBRA_ACK_VOD_3934994','ACK_VOD_3934994')
        ,('ZEBRA_ZEBRA_29393930','ZEBRA_ZEBRA_29393930','ZEBRA_29393930')
)
select org_value
    ,seq
    ,array_to_string(array_agg(part) within group (order by index), '_') as output
from (
    select d.value as org_value
        ,f.seq
        ,f.index
        ,f.value as part
        ,case when part='ZEBRA' then 1
            when part='ACK' then 1
            else 0
         end bad_bit
        ,sum(bad_bit)over(partition by f.seq order by f.index desc) as c
    from data d, table(split_to_table(d.value,'_')) f
)
where c <= 1
group by org_value, seq
order by seq

gives:

ORG_VALUE                SEQ    OUTPUT
ACK_ACK_DOV_90000        1  ACK_DOV_90000
ACK_910101               2  ACK_910101
ACK_XIS_900000000        3  ACK_XIS_900000000
GGG_0000000              4  GGG_0000000
ASC_VNA_303930           5  ASC_VNA_303930
ACK_393848489            6  ACK_393848489
ACK_VNA_30303            7  ACK_VNA_30303
ACK_XPM_303030303030     8  ACK_XPM_303030303030
ACK_ACK_DOV_39393        9  ACK_DOV_39393
ZEBRA_0393930            10 ZEBRA_0393930
ZEBRA_393939_DOV         11 ZEBRA_393939_DOV
ZEBRA_VNA_3930321        12 ZEBRA_VNA_3930321
ZEBRA_ACK_ACK_DOV_3934994   13  ACK_DOV_3934994
ZEBRA_ZEBRA_29393930     14 ZEBRA_29393930
Sign up to request clarification or add additional context in comments.

1 Comment

@analytica I have re-spun it to apply your new logic/explation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.