Optimal way to extract string in redshift without using regexp

Question

Lets say we want to extract the substring from url till second occurrence of /.

e.g. https://abc.def.com/abc?102/ extracted string should be abc.def.com/abc without ?102

http://abc.def/jkl/ghi/ extracted string should be abc.def/jkl

I want to achieve this without using regexp_substr/regexp_replace, which I have already tried.

My observation: REGEXP is quite costly, suppose there is a large table of worth 100+ GB then it will take a lot of time. — Sayed Awesh Rahman
– Sayed Awesh Rahman, Commented Jan 27, 2020 at 14:43

GMB · Accepted Answer · 2020-01-27 14:46:17Z

1

If you specifically want to avoid regexes, you could use split_part() twice:

select split_part(url, '/', 1) || '/' || split_part(url, '/', 2)

I am unsure, however, that this would perform better than a regex-based solution. You would need to benchmark this against your real dataset.

edited Jan 27, 2020 at 14:46

answered Jan 27, 2020 at 14:44

GMB

224k25 gold badges103 silver badges151 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

GMB Over a year ago

@SayedAweshRahman: as explained in my answer, it is hard to tell beforehand. You would probably need to test it (I'd be actually interested to know which solution performs better).

Sayed Awesh Rahman Over a year ago

Using REGEXP is actually very costly where I benchmarked with 20M rows: REGEXP took 12 mins 30 Sec Where as SPLIT_PART just took 2 min 09 seconds to give the result

Collectives™ on Stack Overflow

Optimal way to extract string in redshift without using regexp

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related