0
30   <li class="start_1">
31               <input type="checkbox" name="word_ids[]" value="34" class="list_check">
32          </li> 

This is a part of html file that I want to parse. But when I applied

uh = open('1.htm','r')
data = uh.read()
print data  
tree = ET.fromstring(data)

It showed

xml.etree.ElementTree.ParseError: mismatched tag: line 32, column 18

I don't know what is going wrong?

0

2 Answers 2

1

You are trying to parse HTML with an XML parser; the latter doesn't have a concept of <input> not having a closing tag.

Use an actual HTML parser; if you want to access the result with an ElementTree-compatible API, use the lxml project, which includes an HTML parser. Otherwise, use BeautifulSoup (which can use lxml under the hood as the parsing engine).

Sign up to request clarification or add additional context in comments.

Comments

1

To parse HTML in Python i use lxml:

import lxml.html
// html string
dom = '<li class="start_1">...</li>'
// get the root node
root_node = lxml.html.fromstring(dom)

after that you can play with it, for example using xpath:

nodes = root_node.xpath("//*")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.