3

I wrote a Javascript routine that, given a hostname or a URL, it finds the root domain.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

What is the technique to combine the two regex rules into one rule?

I used this tutorial to try to advance my existing RegExp experience over the years, although I've never really understood lookbehinds and lookaheads (which might be useful here?), and then used the great tool at RegEx101.com for trial and error. What I tried was to stick what's after <root> to replace what comes after <domain>, and variations on that, and all failed.

A test set to use with a tool like RegEx101 could be:

https://test.com:8080/?id=4&re=3
https://test-test.com:8080/?id=4&re=3
https://data.test.com:8080/?id=4&re=3
https://data.test.com/?id=4&re=3
https://data.test.com/
https://data.test.com#testing
https://data.test.com/#testing
https://data.test.com:8080/#testing
https://data.test.com:8080#testing
https://data.tester.com/
https://data-test.test.com/
https://test.com
https://test.com#testing
https://test.com/
https://test.am/?id=4
https://test.com?id=3&re=3
https://test.com/?id=3&re=3
https://megatest.com/?id=3&re=3

test.com
data.test.co.uk
test.co
data.test.com
data.tester-test.com
data-test.tester-test.com
tester-test.com
about:blank
2
  • Oh, I just noticed that you're the one who posted the answer this was taken from. I thought it was someone else asking how to improve on your answer. Commented May 2, 2022 at 21:47
  • 1
    I saw your reputation when I was looking at the other answer. Like I said, I didn't notice that it was you posting this one. Commented May 2, 2022 at 21:56

1 Answer 1

1

The second regexp uses the $ assertion to only match the end of the .domain capture.

The first RegExp, however stops matching after the domain (when it meets either a /, a ?, a #, a : or the end of the string if there is no path, query string or hash parts. So you can't just reuse the $ assertion, it would fail in some cases.

To combine both parts, you could replace the domain capture with this:

.*?(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)

(?:[\/?#]|$) at the end is a non-capturing group that matches either the target characters or the end of the string.

.*? frugally matches anything. That is, it first tries to match the root capture followed by (?:[\/?#]|$). Every time that fails, it eats one character and tries again, letting you search for the root.

Also:

  • you can combine \.\w{3,}|\.\w{2} into just \.\w{2,}.

  • you can use a non-capturing group around the TLDs ((?:...) vs (...).

  • It would be better to use .*? to get the protocol, or you could end up globbing too much (with a greedy .*, passing https://example.com/#://bar.com would return bar.com).

  • You don't need to escape the :. In unicode mode, that escape is actually a syntax error.

Resulting into

const x = /^(?:.*?:\/\/)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/

I actually wrote a RegExp builder that may help you get further in your RegExp learning journey... Here's your RegExp ported to compose-regexp

Sign up to request clarification or add additional context in comments.

7 Comments

This works. You'll want to make a small revision to handle potential port numbers on the domains such as https://data.test.com:8080/. Here's the change I made: /^(?:.*\:\/?\/)?.*?(?<root>[\w\-]*(\.\w{2,}|\.\w{2}\.\w{2}))(?:[\:\/?#]|$)/ I ran through regex101.com.
Great catch @Volomike, thanks! I've updated the answer accordingly. (And thanks to your upvote, I can at long last comment here, which is sweet :-)
@volomike, I've tweaked the response a bit with further refinements. Hopefully you'll find them helpful
In this part ^(?:.*?:\/\/?)?, why the second-to-last ? ? Shouldn't it be ^(?:.*?:\/\/)? ? See, the ? alone would mean "previous character is optional", when in fact all the characters there are optional -- one might see an https://, http://, ftp://, or perhaps just start with the domain.
I went ahead with my editing power and edited your answer to remove that extra ? that was unnecessary.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.