Is there a validating HTML parser implemented in Java?

Question

I need to parse HTML 4 in Java. Ideally I'd like an implementation that is SAX compatible.

I'm aware that there are numerous HTML parsers in for Java, however, they all seem to perform 'tidying'. In other words, they will correct badly formed HTML. I don't want this.

My requirements are:

No tidying.
If the input document is invalid HTML parsing should fail.
The document should be validatable against the HTML DTDs.
The parser can produce SAX2 events.

Is there a library that meets these requirements?

If the parser doesn't tidy, it can't create a DOM tree; a valid HTML document may not be valid XML document (e.g., think of all those <p> tags that have no corresponding closing tags). — jdigital
– jdigital, Commented May 24, 2009 at 20:33
It could fire SAX events as if it were a <p/> xml element - right? — johnstok
– johnstok, Commented May 25, 2009 at 8:16

adrian.tarau · Accepted Answer · 2009-05-24 18:16:54Z

2

You can find a collection of HTML parsers here HTML Parsers. I don't remeber exactly but I think TagSoup parses the file without applying corrections...

answered May 24, 2009 at 18:16

adrian.tarau

3,1542 gold badges27 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

johnstok Over a year ago

"TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild..." Unfortunately not.

adrian.tarau Over a year ago

"It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on."

adrian.tarau Over a year ago

If it is able to populate default attributes this mean it parses the DTD...it's not clear if it fails if the document fails to be validated.

adrian.tarau Over a year ago

Also have a look at javax.swing.text.html.parser.Parser, it looks like it does DTD validation protected void endTag(boolean omitted) { handleText(stack.tag); if (omitted && !stack.elem.omitEnd()) { error("end.missing", stack.elem.getName()); } else if (!stack.terminate()) { error("end.unexpected", stack.elem.getName()); }

Paul · Accepted Answer · 2016-10-13 13:22:30Z

I think the Jericho HTML Parser can deliver at least one of your core requirements ('If the input document is invalid HTML parsing should fail.') in that it will at least tell you if there are mismatched tags or other poisonous HTML flaws, and you can choose to fail based on this information.

Try typing invalid html into this Jericho formatting demo, and note the 'Parser Log' at the bottom of the page:

http://jerichohtmlparser.appspot.com/samples/FormatSource.jsp

So yes, this is doing tag tidying, but it is at least telling you about it - you can grab this information by setting a net.htmlparser.jericho.Logger (e.g. a WriterLogger or something more specific of your own creation) on your source, and then proceeding depending on what errors are logged out. This is a small example:

    Source source=new Source("<a>I forgot to close my link!");
    source.setLogger(myListeningLogger);

    source.getSourceFormatter().writeTo(new NullWriter());
    // myListeningLogger has now had all the HTML flaws written to it

In the example above, your logger's info() method is called with the string: 'StartTag at (r1,c1,p0) missing required end tag', which is relatively parseable, and you can always decide to just reject any HTML that logs any message worse than debug - in fact Jericho logs almost all errors as 'info' level, with a couple at 'warn' level (you might be tempted to create a small fork with the severities adjusted to correspond to what you care about).

Jericho is available on Maven Central, which is always a good sign:

http://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html

Good luck!

monceaux · Accepted Answer · 2009-05-25 08:34:36Z

1

You may wish to check http://lobobrowser.org/cobra.jsp. They have a pure Java web browser (Lobo) implemented. They have the parser component (Cobra) pulled out separately for use. I honestly am not sure if it will do what you require with the "no tidying" requirement, but it may be worth a look. I ran across it when exploring the wild for a pure Java web browser.

answered May 25, 2009 at 8:34

monceaux

5862 silver badges5 bronze badges

Comments

David Rabinowitz · Accepted Answer · 2009-05-25 10:12:10Z

0

You can try to subclass javax.swing.text.html.parser.Parser and implement the handleXXX() methods. It seems it doesn't try to fix the XML. See more at the API

answered May 25, 2009 at 10:12

David Rabinowitz

30.5k16 gold badges95 silver badges125 bronze badges

Collectives™ on Stack Overflow

Is there a validating HTML parser implemented in Java?

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related