Python SAX parser says XML file is not well-formed

Question

I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?

Edit: I think I've found the problem. My character data contains "&lt" and "&gt" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?

The opening tag link to an XML schema might be a namespace. You'll want to leave that in. — mechanical_meat
– mechanical_meat, Commented Apr 2, 2009 at 6:43
Never give a summary of the error message ("says my file is not well-formed"). Always the litteral message. — bortzmeyer
– bortzmeyer, Commented Apr 3, 2009 at 9:32

paxdiablo · Accepted Answer · 2009-04-02 06:42:50Z

2

I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.

However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).

answered Apr 2, 2009 at 6:42

paxdiablo

888k243 gold badges1.6k silver badges2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jon Skeet · Accepted Answer · 2009-04-03 21:24:08Z

1

Does the sax parser not give you details about where it thinks it's not well-formed?

Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?

The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)

EDIT: To represent "&" in text, you should escape it as &

So:

&lt

should be

&amp;lt

(assuming you really want ampersand, l, t).

edited Apr 3, 2009 at 21:24

answered Apr 2, 2009 at 6:38

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

4 Comments

Jacob Lyles Over a year ago

I examined the file in the offending place, and it is only character data (unless I'm counting the lines wrong). Unfortunately, the file is too large to be worked with in a standard editor. I have a root tag, and open and close tags. This remains a mystery.

Jon Skeet Over a year ago

Try it with another non-DOM parser (XmlReader in .NET, or maybe SAX in Java) and see whether it works there or possibly gives more useful information.

bortzmeyer Over a year ago

"Too large"? Stop using vague words. How many bytes is it? It may be time to switch a serious editor...

Jacob Lyles Over a year ago

I figured out the problem. I am reading in the data with the SAX parser and writing it back out. In process, some character data with "&lt" "&gt" symbols are getting converted into "< >". Presumably they are HTML tags. Do you know how to stop this from happening?

StaxMan · Accepted Answer · 2009-04-02 18:32:53Z

0

I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.

Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.

answered Apr 2, 2009 at 18:32

StaxMan

117k35 gold badges215 silver badges241 bronze badges

Comments

stesch · Accepted Answer · 2009-04-03 21:31:39Z

0

You could load it into Firefox, if you don't have an XML editor. Firefox shows you the error.

answered Apr 3, 2009 at 21:31

stesch

7,2356 gold badges50 silver badges63 bronze badges

Collectives™ on Stack Overflow

Python SAX parser says XML file is not well-formed

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related