1

I have a string with HTML content and I need to grab all links to .css and .js files. Now, I am using this pattern "(http:.*?.\\.css)" to grab all CSS links, but how I can include .js links, too?

Here is my full code:

List<String> urlList =  new ArrayList<String>();
String str = new String(Files.readAllBytes(FileSystems.getDefault().getPath("c:" + File.separator + "nutchfiles" + File.separator + "test.html")));
Pattern p = Pattern.compile("(http:.*?.\\.css)");
Matcher m = p.matcher(str);

while (m.find()) {
    LOG.info("matched urls" + m.group());
}

1 Answer 1

3

If you are looking for a regex fix, here it is:

Pattern p = Pattern.compile("(http:.*?\\.(?:css|js)\\b)");

The alternation will help you match both extensions. See Alternation with The Vertical Bar or Pipe Symbol:

If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.

However, you'd be safer with an HTML parser to get whatever contents from your HTML files.

Sign up to request clarification or add additional context in comments.

2 Comments

what about "index.jsp" and "jquery-3.5.1.min.js.js?cv=20231212_000002" ? Is these are exception ?
@MrSalesi The post belongs to the times when I was starting to delve deeper into the regex. It answers the OP problem, including more than one extension. If you want to match something more complex, examine the input text you have and check what the expected match boundaries are. If the strings are inside double quotes, you may add [^"]* after \\b. If they go till the first whitespace, add \\S*. If you have a specific question that you tried to solve and you cannot solve it, consider asking a new question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.