0

I stripped some tags that I thought were unnecessary from an XML file. Now when I try to parse it, my SAX parser throws an error and says my file is not well-formed. However, I know every start tag has an end tag. The file's opening tag has a link to an XML schema. Could this be causing the trouble? If so, then how do I fix it?

Edit: I think I've found the problem. My character data contains "&lt" and "&gt" characters, presumably from html tags. After being parsed, these are converted to "<" and ">" characters, which seems to bother the SAX parser. Is there any way to prevent this from happening?

4
  • validator.w3.org Commented Apr 2, 2009 at 6:41
  • The opening tag link to an XML schema might be a namespace. You'll want to leave that in. Commented Apr 2, 2009 at 6:43
  • Might help is you provided the actual error from SAX. Commented Apr 2, 2009 at 16:05
  • Never give a summary of the error message ("says my file is not well-formed"). Always the litteral message. Commented Apr 3, 2009 at 9:32

4 Answers 4

2

I would suggest putting those tags back in and making sure it still works. Then, if you want to take them out, do it one at a time until it breaks.

However, I question the wisdom of taking them out. If it's your XML file, you should understand it better. If it's a third-party XML file, you really shouldn't be fiddling with it (until you understand it better :-).

Sign up to request clarification or add additional context in comments.

Comments

1

Does the sax parser not give you details about where it thinks it's not well-formed?

Have you tried loading the file into an XML editor and checking it there? Do other XML parsers accept it?

The schema shouldn't change whether or not the XML is well-formed or not; it may well change whether it's valid or not. See the wikipedia entry for XML well-formedness for a little bit more, or the XML specs for a lot more detail :)

EDIT: To represent "&" in text, you should escape it as &amp;

So:

&lt

should be

&amp;lt

(assuming you really want ampersand, l, t).

4 Comments

I examined the file in the offending place, and it is only character data (unless I'm counting the lines wrong). Unfortunately, the file is too large to be worked with in a standard editor. I have a root tag, and open and close tags. This remains a mystery.
Try it with another non-DOM parser (XmlReader in .NET, or maybe SAX in Java) and see whether it works there or possibly gives more useful information.
"Too large"? Stop using vague words. How many bytes is it? It may be time to switch a serious editor...
I figured out the problem. I am reading in the data with the SAX parser and writing it back out. In process, some character data with "&lt" "&gt" symbols are getting converted into "< >". Presumably they are HTML tags. Do you know how to stop this from happening?
0

I would second recommendation to try to parse it using another XML parser. That should give an indication as to whether it's the document that's wrong, or parser.

Also, the actual error message might be useful. One fairly common problem for example is that the xml declaration (if one is used, it's optional) must be the very first thing -- not even whitespace is allowed before it.

Comments

0

You could load it into Firefox, if you don't have an XML editor. Firefox shows you the error.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.