0

So far I have used the preg_match_all function with various expressions, but I am not good at regex.

I have a string (downloaded html page). Of course, there are a lot of things on it. Including assets. I need to extract all valid domains and IPv4 addresses from this string.

If it is possible from a regular expression: I would also like to remove the rest of the address and query. However, if this is not possible, I can remove it in later processing.

This expression for domains works quite well, although it could be better, because it also catches garbage like "/html/style/global.css.php". And does not work on IP addresses

preg_match_all('#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#si', $response->body(), $match);
8
  • 2
    "I need" is not a question. Please show what you tried and we'll help you fix it, we won't write it for you. Try googling "domain name regular expression" Commented Oct 11, 2023 at 21:55
  • I added my current expression Commented Oct 11, 2023 at 22:21
  • Why {2,256}? x.com is a valid domain. Commented Oct 11, 2023 at 22:27
  • You have lots of characters that aren't allowed in domain names. They can only be letters, digits, - and .. Not @, ~, &, etc. Commented Oct 11, 2023 at 22:28
  • 1
    I would just like to extract all potential domains and ipv4 addresses from the string. It doesn't matter whether the domain exists, it simply should match the pattern Commented Oct 12, 2023 at 8:29

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.