Parsing HTML in web crawler

Question

Further to my earlier question here: Extending a basic web crawler to filter status codes and HTML , I'm trying to extract information from HTML tags, in this case "title", with the following method:

public static void parsePage() throws IOException, BadLocationException 
{
    HTMLEditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    Reader HTMLReader = new InputStreamReader(testURL.openConnection()
            .getInputStream());
    kit.read(HTMLReader, doc, 0);

    // Create an iterator for all HTML tags.
    ElementIterator it = new ElementIterator(doc);
    Element elem;

    while ((elem = it.next()) != null) 
    {
        if (elem.getName().equals("title")) 
        {
            System.out.println("found title tag");
        }
    }
}

This is working as far as telling me it's found the tags. What I'm struggling with is how to extract the information contained after/within them.

I found this question on the site: Help with Java Swing HTML parsing , however it states it will only work with well-formed HTML. I was hoping there is another way.

Any pointers appreciated.

Alexey Grigorev · Accepted Answer · 2012-07-14 21:24:02Z

3

Try using Jodd

Jerry jerry = jerry().enableHtmlMode().parse(html);
...

Or HtmlParser

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("title");
NodeList nodes = parser.parse(cssFilter);

answered Jul 14, 2012 at 21:24

Alexey Grigorev

2,44529 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Robert Over a year ago

Thanks Alexey. Is there a way to do it without using an external library?

Alexey Grigorev Over a year ago

If you need a quick-and-dirty throw-away solution, you may extract the title using regular expressions, but, in general, avoid using regexps for HTML

Robert Over a year ago

Yes, I undersand that using regexps for parsing HTML is frowned upon. In this instance I only need the "title" information.

Alexey Grigorev Over a year ago

In this case, given you don't like having an additional dependency, I would go with a simple regexp.

Robert · Accepted Answer · 2013-04-11 10:21:37Z

1

Turns out changing the method to this produces the desired result:

    {
            HTMLEditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            Reader HTMLReader = new InputStreamReader(testURL.openConnection().getInputStream());
            kit.read(HTMLReader, doc, 0);
            String title = (String) doc.getProperty(Document.TitleProperty);
            System.out.println(title);
    }

I think I was off on a wild goose chase with iterator/element stuff.

edited Apr 11, 2013 at 10:21

answered Jul 14, 2012 at 21:57

Robert

5,31843 gold badges70 silver badges115 bronze badges

Collectives™ on Stack Overflow

Parsing HTML in web crawler

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related