Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

Question

I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?

The Python standard libraries are OK for basic stuff, but almost none of them are complete. Third party libraries more usually more complete, featureful, etc. So don't hold back, you have to get used to installing and using third-party eventually. — Keith
– Keith, Commented Feb 13, 2011 at 10:11

Ira Baxter · Accepted Answer · 2011-02-13 08:37:58Z

4

The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.

You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).

One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.

answered Feb 13, 2011 at 8:37

Ira Baxter

95.9k24 gold badges188 silver badges357 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kryptobs2000 Over a year ago

Thanks, that does work for now, but I'm going to leave the question open to hopefully find a better answer. I'm not going to distribute this most likely, but I'm trying to make it an exercise to not include any additional dependencies. If that means cleaning it up myself I might do that even.

Matt Joiner · Accepted Answer · 2011-02-13 08:40:41Z

1

The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

However it isn't well-supported for Python 3, there's more information about this at the end of the link.

answered Feb 13, 2011 at 8:40

Matt Joiner

120k117 gold badges391 silver badges545 bronze badges

1 Comment

kryptobs2000 Over a year ago

Any good links on cleaning it up myself? I'm trying to avoid any dependencies outside the standard library just as a learning exercise. Maybe it's just not worth it though, it's going to be about a 300 line script, I don't mind doubling it even, but if it's going to be much beyond that it's probably not.

Lennart Regebro · Accepted Answer · 2011-02-14 11:04:57Z

0

Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.

lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.

BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.

edited Feb 14, 2011 at 11:04

answered Feb 13, 2011 at 20:35

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Collectives™ on Stack Overflow

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related