1

Further to my earlier question here: Extending a basic web crawler to filter status codes and HTML , I'm trying to extract information from HTML tags, in this case "title", with the following method:

public static void parsePage() throws IOException, BadLocationException 
{
    HTMLEditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    Reader HTMLReader = new InputStreamReader(testURL.openConnection()
            .getInputStream());
    kit.read(HTMLReader, doc, 0);

    // Create an iterator for all HTML tags.
    ElementIterator it = new ElementIterator(doc);
    Element elem;

    while ((elem = it.next()) != null) 
    {
        if (elem.getName().equals("title")) 
        {
            System.out.println("found title tag");
        }
    }
}

This is working as far as telling me it's found the tags. What I'm struggling with is how to extract the information contained after/within them.

I found this question on the site: Help with Java Swing HTML parsing , however it states it will only work with well-formed HTML. I was hoping there is another way.

Any pointers appreciated.

2 Answers 2

3

Try using Jodd

Jerry jerry = jerry().enableHtmlMode().parse(html);
...

Or HtmlParser

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("title");
NodeList nodes = parser.parse(cssFilter);
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks Alexey. Is there a way to do it without using an external library?
If you need a quick-and-dirty throw-away solution, you may extract the title using regular expressions, but, in general, avoid using regexps for HTML
Yes, I undersand that using regexps for parsing HTML is frowned upon. In this instance I only need the "title" information.
In this case, given you don't like having an additional dependency, I would go with a simple regexp.
1

Turns out changing the method to this produces the desired result:

    {
            HTMLEditorKit kit = new HTMLEditorKit();
            HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
            doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
            Reader HTMLReader = new InputStreamReader(testURL.openConnection().getInputStream());
            kit.read(HTMLReader, doc, 0);
            String title = (String) doc.getProperty(Document.TitleProperty);
            System.out.println(title);
    }

I think I was off on a wild goose chase with iterator/element stuff.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.