Regex to Extract Environment, Domain, and Hostname from URL with Variable Subdomains

Question

I am working on a project where I need to extract specific information from URLs, specifically the environment, domain, and hostname. The URLs have variable subdomains, and I'm having difficulty constructing a regex pattern to capture the required groups.

Link: https://regex101.com/r/4DhLns/3

I need help crafting a regex pattern that can efficiently capture the following groups:

Group 1: environment (e.g., stage, qa)
Group 2: hostname (e.g., hostname)
Group 3: domain (e.g., com)

const regex = /.*(?<environment>(qa|stage*)).*\.(?<hostname>\w+)*\.(?<domain>\w+)$/;

function extractInfoFromURL(url) {
    const match = url.match(regex);
    
    if (match) {
        return match.groups;
    } else {
        return null; // URL didn't match the pattern
    }
}

const testUrls = [
    "https://example.test.qa.sub.hostname.com",
    "https://example.test.stage.coonect.hostname.com",
    "https://example.qa.hostname.com",
    "https://example.hostname.com",
    "https://example.stage.hostname.com",
    "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
];

testUrls.forEach((url, index) => {
    const result = extractInfoFromURL(url);
    
    if (result) {
        console.log(`Result for URL ${index + 1}:`, result);
    } else {
        console.log(`URL ${url} did not match the pattern.`);
    }
});

Here, the issue is with: https://example.hostname.com, env should be null here and the domain and host should be present.

RexEx101: https://regex101.com/r/aCCWRv/2

So what should https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com give, "test" or "qa"? Something like this? \bhttps?:\/\/(?<host>\w+(?:-\w+)*).*\.(?<env>qa|stage|dev|preprod|test).*\.(?<domain>\w+)\.[a-z]{2} regex101.com/r/iOk7rH/1 — The fourth bird
– The fourth bird, Commented Nov 23, 2023 at 18:02
@Thefourthbird, my issue is with example.hostname.com, none of the group is captured in this case, env should be null, with rest of the groups. — Sumit Ridhal
– Sumit Ridhal, Commented Nov 23, 2023 at 18:24
I think it would be easier to specify per link what exactly you want to capture. — The fourth bird
– The fourth bird, Commented Nov 23, 2023 at 18:25
@Thefourthbird, this is right, I was struggling with env to match zero or one time. Good catch on for Delimeter. — Sumit Ridhal
– Sumit Ridhal, Commented Nov 23, 2023 at 19:03

The fourth bird · Accepted Answer · 2023-11-23 19:18:19Z

The first part .*? can be omitted from the pattern. If there can not be spaces in the match, then .*? could be \S*? matching as least as possible non whitespace characters.

The named group already is a group, so you don't have to specify another separate capture group inside it.

If the environment is optional, then you can use an optional non capture group until the part where the "hostname" starts.

The leading \b is a word boundary to prevent a partial word match.

Currently you are using \w which might be limited to match the allowed characters. You could extend it using a character class [...] specifying all allowed characters.

\b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$

Regex demo

const regex = /\b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$/;

function extractInfoFromURL(url) {
  const match = url.match(regex);

  if (match) {
    return match.groups;
  } else {
    return null; // URL didn't match the pattern
  }
}

const testUrls = [
  "https://example.test.qa.sub.hostname.com",
  "https://example.test.stage.coonect.hostname.com",
  "https://example.qa.hostname.com",
  "https://example.hostname.com",
  "https://example.stage.hostname.com",
  "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
];

testUrls.forEach((url, index) => {
  const result = extractInfoFromURL(url);

  if (result) {
    console.log(`Result for URL ${index + 1}:`, result);
  } else {
    console.log(`URL ${url} did not match the pattern.`);
  }
});

Reilas · Accepted Answer · 2023-11-24 03:16:01Z

0

"... I need to extract specific information from URLs, specifically the environment, domain, and hostname. ..."

Try the following capture pattern.

.+?:\/\/(?:(.*(?<=qa|stage).*)|.+?)\.(.+)\.(.+?)$

s = `https://example.sub.qa.sub.hostname.com
https://example.sub.stage.coonect.hostname.com
https://example.qa.hostname.com
https://example.stage.hostname.com
https://example.hostname.com
https://ops-cert-stage-beta.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-qa-beta.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-qa.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-stage.apps.sub-test.minor.qa.sub.sub.hostname.com`
p = /.+?:\/\/(?:(.*(?<=qa|stage).*)|.+?)\.(.+)\.(.+?)$/gm
for (let x of s.matchAll(p))
    console.log([...x.slice(1, 4)])

edited Nov 24, 2023 at 3:16

answered Nov 24, 2023 at 2:56

Reilas

6,2762 gold badges7 silver badges18 bronze badges

Collectives™ on Stack Overflow

Regex to Extract Environment, Domain, and Hostname from URL with Variable Subdomains

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related