How To Combine These 2 Regexp in Javascript

Question

I wrote a Javascript routine that, given a hostname or a URL, it finds the root domain.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

What is the technique to combine the two regex rules into one rule?

I used this tutorial to try to advance my existing RegExp experience over the years, although I've never really understood lookbehinds and lookaheads (which might be useful here?), and then used the great tool at RegEx101.com for trial and error. What I tried was to stick what's after <root> to replace what comes after <domain>, and variations on that, and all failed.

A test set to use with a tool like RegEx101 could be:

https://test.com:8080/?id=4&re=3
https://test-test.com:8080/?id=4&re=3
https://data.test.com:8080/?id=4&re=3
https://data.test.com/?id=4&re=3
https://data.test.com/
https://data.test.com#testing
https://data.test.com/#testing
https://data.test.com:8080/#testing
https://data.test.com:8080#testing
https://data.tester.com/
https://data-test.test.com/
https://test.com
https://test.com#testing
https://test.com/
https://test.am/?id=4
https://test.com?id=3&re=3
https://test.com/?id=3&re=3
https://megatest.com/?id=3&re=3

test.com
data.test.co.uk
test.co
data.test.com
data.tester-test.com
data-test.tester-test.com
tester-test.com
about:blank

Oh, I just noticed that you're the one who posted the answer this was taken from. I thought it was someone else asking how to improve on your answer. — Barmar
– Barmar, Commented May 2, 2022 at 21:47
I saw your reputation when I was looking at the other answer. Like I said, I didn't notice that it was you posting this one. — Barmar
– Barmar, Commented May 2, 2022 at 21:56

Volomike · Accepted Answer · 2022-05-05 08:22:08Z

1

The second regexp uses the $ assertion to only match the end of the .domain capture.

The first RegExp, however stops matching after the domain (when it meets either a /, a ?, a #, a : or the end of the string if there is no path, query string or hash parts. So you can't just reuse the $ assertion, it would fail in some cases.

To combine both parts, you could replace the domain capture with this:

.*?(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)

(?:[\/?#]|$) at the end is a non-capturing group that matches either the target characters or the end of the string.

.*? frugally matches anything. That is, it first tries to match the root capture followed by (?:[\/?#]|$). Every time that fails, it eats one character and tries again, letting you search for the root.

Also:

you can combine \.\w{3,}|\.\w{2} into just \.\w{2,}.
you can use a non-capturing group around the TLDs ((?:...) vs (...).
It would be better to use .*? to get the protocol, or you could end up globbing too much (with a greedy .*, passing https://example.com/#://bar.com would return bar.com).
You don't need to escape the :. In unicode mode, that escape is actually a syntax error.

Resulting into

const x = /^(?:.*?:\/\/)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/

I actually wrote a RegExp builder that may help you get further in your RegExp learning journey... Here's your RegExp ported to compose-regexp

edited May 5, 2022 at 8:22

Volomike

25k23 gold badges128 silver badges220 bronze badges

answered May 3, 2022 at 19:47

Pygy

865 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Volomike Over a year ago

This works. You'll want to make a small revision to handle potential port numbers on the domains such as https://data.test.com:8080/. Here's the change I made: /^(?:.*\:\/?\/)?.*?(?<root>[\w\-]*(\.\w{2,}|\.\w{2}\.\w{2}))(?:[\:\/?#]|$)/ I ran through regex101.com.

Pygy Over a year ago

Great catch @Volomike, thanks! I've updated the answer accordingly. (And thanks to your upvote, I can at long last comment here, which is sweet :-)

Pygy Over a year ago

@volomike, I've tweaked the response a bit with further refinements. Hopefully you'll find them helpful

Volomike Over a year ago

In this part ^(?:.*?:\/\/?)?, why the second-to-last ? ? Shouldn't it be ^(?:.*?:\/\/)? ? See, the ? alone would mean "previous character is optional", when in fact all the characters there are optional -- one might see an https://, http://, ftp://, or perhaps just start with the domain.

Volomike Over a year ago

I went ahead with my editing power and edited your answer to remove that extra ? that was unnecessary.

|

Collectives™ on Stack Overflow

How To Combine These 2 Regexp in Javascript

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related