2

I am trying to extract page title from HTML and XML pages. This is the regular expression I use:

Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");

The problem is that it only extracts the title from HTML files and gives me null for XML files. Can any one help me in changing the regex to the get the XML page titles as well?

Code:

content= stringBuilder.toString(); // put content of the file as a string
Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}
5
  • 6
    Have you considered not using a regex to parse HTML? Commented Mar 28, 2012 at 17:27
  • This sort of question is common, and the answer is the same: regex isn't suitable for parsing HTML. That being said, for something very tactical like this, you might be successful. Post your code and we'll look at it. Commented Mar 28, 2012 at 17:35
  • content= stringBuilder.toString(); // put content of the file as a string Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*"); Matcher m = p.matcher(content); while (m.find()) { title = m.group(1); } Commented Mar 28, 2012 at 17:56
  • What is the structure the title comes in for XML? There's no need for an XML file to obey the head-title structure that HTML uses. Commented Mar 28, 2012 at 18:22
  • 1
    possible duplicate of Extracting Information from websites Commented Mar 28, 2012 at 20:42

2 Answers 2

3

As said above, regexp are not suited for XML and HTML parsing. However, in some cases it come in handy, so here is something that should work:

Pattern p = Pattern.compile("<head>.*?<title>(.*?)</title>.*?</head>", Pattern.DOTALL); 
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}

If you use a Matcher, there is no need to put .* before and after (since they are not part of any group). You may also look into reluctant qualifier (ie, *? instead of *, +? instead of +, etc...) if it does not. Finally you should also use the Pattern.DOT_ALL flag otherwise the dot does not match the line terminator character

Sign up to request clarification or add additional context in comments.

Comments

1

OMG.. Regular expressions for this ? What about following (for example to strip body portion )

StringBuilder sb = new StringBuilder();
sb.append(html, html.indexOf("<body>") + 6, html.lastIndexOf("</body>"));
String headless = sb.toString();
System.out.println(headless);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.