Python parsing html mismatched tag error

Question

30   <li class="start_1">
31               <input type="checkbox" name="word_ids[]" value="34" class="list_check">
32          </li>

This is a part of html file that I want to parse. But when I applied

uh = open('1.htm','r')
data = uh.read()
print data  
tree = ET.fromstring(data)

It showed

xml.etree.ElementTree.ParseError: mismatched tag: line 32, column 18

I don't know what is going wrong?

Martijn Pieters · Accepted Answer · 2016-09-25 13:59:07Z

1

You are trying to parse HTML with an XML parser; the latter doesn't have a concept of <input> not having a closing tag.

Use an actual HTML parser; if you want to access the result with an ElementTree-compatible API, use the lxml project, which includes an HTML parser. Otherwise, use BeautifulSoup (which can use lxml under the hood as the parsing engine).

answered Sep 25, 2016 at 13:59

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mirko Conti · Accepted Answer · 2016-09-25 14:28:44Z

1

To parse HTML in Python i use lxml:

import lxml.html
// html string
dom = '<li class="start_1">...</li>'
// get the root node
root_node = lxml.html.fromstring(dom)

after that you can play with it, for example using xpath:

nodes = root_node.xpath("//*")

answered Sep 25, 2016 at 14:28

Mirko Conti

6185 silver badges17 bronze badges

Collectives™ on Stack Overflow

Python parsing html mismatched tag error

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related