1

I am working on a project where I need to extract specific information from URLs, specifically the environment, domain, and hostname. The URLs have variable subdomains, and I'm having difficulty constructing a regex pattern to capture the required groups.

enter image description here

Link: https://regex101.com/r/4DhLns/3

I need help crafting a regex pattern that can efficiently capture the following groups:

  • Group 1: environment (e.g., stage, qa)
  • Group 2: hostname (e.g., hostname)
  • Group 3: domain (e.g., com)

const regex = /.*(?<environment>(qa|stage*)).*\.(?<hostname>\w+)*\.(?<domain>\w+)$/;

function extractInfoFromURL(url) {
    const match = url.match(regex);
    
    if (match) {
        return match.groups;
    } else {
        return null; // URL didn't match the pattern
    }
}

const testUrls = [
    "https://example.test.qa.sub.hostname.com",
    "https://example.test.stage.coonect.hostname.com",
    "https://example.qa.hostname.com",
    "https://example.hostname.com",
    "https://example.stage.hostname.com",
    "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
    "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
];

testUrls.forEach((url, index) => {
    const result = extractInfoFromURL(url);
    
    if (result) {
        console.log(`Result for URL ${index + 1}:`, result);
    } else {
        console.log(`URL ${url} did not match the pattern.`);
    }
});

Here, the issue is with: https://example.hostname.com, env should be null here and the domain and host should be present.

RexEx101: https://regex101.com/r/aCCWRv/2

6
  • So what should https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com give, "test" or "qa"? Something like this? \bhttps?:\/\/(?<host>\w+(?:-\w+)*).*\.(?<env>qa|stage|dev|preprod|test).*\.(?<domain>\w+)\.[a-z]{2} regex101.com/r/iOk7rH/1 Commented Nov 23, 2023 at 18:02
  • @Thefourthbird, my issue is with example.hostname.com, none of the group is captured in this case, env should be null, with rest of the groups. Commented Nov 23, 2023 at 18:24
  • I think it would be easier to specify per link what exactly you want to capture. Commented Nov 23, 2023 at 18:25
  • 1
    Like this? regex101.com/r/n7RGxr/1 Commented Nov 23, 2023 at 18:34
  • 1
    @Thefourthbird, this is right, I was struggling with env to match zero or one time. Good catch on for Delimeter. Commented Nov 23, 2023 at 19:03

2 Answers 2

2

The first part .*? can be omitted from the pattern. If there can not be spaces in the match, then .*? could be \S*? matching as least as possible non whitespace characters.

The named group already is a group, so you don't have to specify another separate capture group inside it.

If the environment is optional, then you can use an optional non capture group until the part where the "hostname" starts.

The leading \b is a word boundary to prevent a partial word match.

Currently you are using \w which might be limited to match the allowed characters. You could extend it using a character class [...] specifying all allowed characters.

\b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$

Regex demo

const regex = /\b(?:(?<environment>qa|stage|dev|preprod)\S*?\.)?(?<hostname>\w+)\.(?<domain>\w+)$/;

function extractInfoFromURL(url) {
  const match = url.match(regex);

  if (match) {
    return match.groups;
  } else {
    return null; // URL didn't match the pattern
  }
}

const testUrls = [
  "https://example.test.qa.sub.hostname.com",
  "https://example.test.stage.coonect.hostname.com",
  "https://example.qa.hostname.com",
  "https://example.hostname.com",
  "https://example.stage.hostname.com",
  "https://ops-cert-stage-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-qa-beta.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-qa.apps.sub-test.minor.qa.test.sub.hostname.com",
  "https://ops-cert-stage.apps.sub-test.minor.qa.test.sub.hostname.com"
];

testUrls.forEach((url, index) => {
  const result = extractInfoFromURL(url);

  if (result) {
    console.log(`Result for URL ${index + 1}:`, result);
  } else {
    console.log(`URL ${url} did not match the pattern.`);
  }
});

Sign up to request clarification or add additional context in comments.

Comments

0

"... I need to extract specific information from URLs, specifically the environment, domain, and hostname. ..."

Try the following capture pattern.

.+?:\/\/(?:(.*(?<=qa|stage).*)|.+?)\.(.+)\.(.+?)$

s = `https://example.sub.qa.sub.hostname.com
https://example.sub.stage.coonect.hostname.com
https://example.qa.hostname.com
https://example.stage.hostname.com
https://example.hostname.com
https://ops-cert-stage-beta.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-qa-beta.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-qa.apps.sub-test.minor.qa.sub.sub.hostname.com
https://ops-cert-stage.apps.sub-test.minor.qa.sub.sub.hostname.com`
p = /.+?:\/\/(?:(.*(?<=qa|stage).*)|.+?)\.(.+)\.(.+?)$/gm
for (let x of s.matchAll(p))
    console.log([...x.slice(1, 4)])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.