java Regular Expression to extract page title

Question

I am trying to extract page title from HTML and XML pages. This is the regular expression I use:

Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");

The problem is that it only extracts the title from HTML files and gives me null for XML files. Can any one help me in changing the regex to the get the XML page titles as well?

Code:

content= stringBuilder.toString(); // put content of the file as a string
Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*");
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}

This sort of question is common, and the answer is the same: regex isn't suitable for parsing HTML. That being said, for something very tactical like this, you might be successful. Post your code and we'll look at it. — Tony Ennis
– Tony Ennis, Commented Mar 28, 2012 at 17:35
content= stringBuilder.toString(); // put content of the file as a string Pattern p = Pattern.compile(".*<head>.*<title>(.*)</title>.*</head>.*"); Matcher m = p.matcher(content); while (m.find()) { title = m.group(1); } — Lucy
– Lucy, Commented Mar 28, 2012 at 17:56
What is the structure the title comes in for XML? There's no need for an XML file to obey the head-title structure that HTML uses. — GetSet
– GetSet, Commented Mar 28, 2012 at 18:22

Guillaume Polet · Accepted Answer · 2012-03-28 21:24:59Z

3

As said above, regexp are not suited for XML and HTML parsing. However, in some cases it come in handy, so here is something that should work:

Pattern p = Pattern.compile("<head>.*?<title>(.*?)</title>.*?</head>", Pattern.DOTALL); 
Matcher m = p.matcher(content);
while (m.find()) {
    title = m.group(1);
}

If you use a Matcher, there is no need to put .* before and after (since they are not part of any group). You may also look into reluctant qualifier (ie, *? instead of *, +? instead of +, etc...) if it does not. Finally you should also use the Pattern.DOT_ALL flag otherwise the dot does not match the line terminator character

answered Mar 28, 2012 at 21:24

Guillaume Polet

47.7k5 gold badges86 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mitja Gustin · Accepted Answer · 2014-04-03 13:01:39Z

1

OMG.. Regular expressions for this ? What about following (for example to strip body portion )

StringBuilder sb = new StringBuilder();
sb.append(html, html.indexOf("<body>") + 6, html.lastIndexOf("</body>"));
String headless = sb.toString();
System.out.println(headless);

answered Apr 3, 2014 at 13:01

Mitja Gustin

1,80114 silver badges19 bronze badges

Collectives™ on Stack Overflow

java Regular Expression to extract page title

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related