2

I have a java class that parses an xml file, and writes its content to MySQL. Everything works fine, but the problem is when the xml file contains invalid unicode characters, an exception is thrown and the program stops parsing the file.

My provider sends this xml file on a daily basis with a list of products with its price, quantity etc. and I have no control over this, so invalid characters will always be there.

All I'm trying to do is to catch these errors, ignore them and continue parsing the rest of the xml file.

I've added a try-catch statements on the startElement, endElement and characters methods of the SAXHandler class, however, they don't catch any exception and the execution stops whenever the parser finds an invalid character.

It seems that I can only catch these exceptions from the function who calls the parser:

    try {
        myIS = new FileInputStream(xmlFilePath);
        parser.parse(myIS, handler);
        retValue = true;
    } catch(SAXParseException err) {
        System.out.println("SAXParseException " + err);
    }

However, that's useless in my case, even if the exception tells me where the invalid character is, the execution stops, so the list of products is far from being complete. This list has about 8,000 products and only a couple of invalid characters, however, if the invalid character is in the first 100 products, then all the 7,900 products are not updated in the database. I've also noticed that the endDocument method is not called if an exception occurs.

Somebody asked the same question here some years ago, but didn't get any solution.

I'd really appreciate any ideas or workarounds for this.

Data Sample (as requested):

<Producto>
 <Brand>
  <Description>Epson</Description>
  <ManufacturerId>eps</ManufacturerId>
  <BrandId>eps</BrandId>
  </Brand>
 <New>false</New>
 <OnSale>null</OnSale>
 <Type>Physical</Type>
 <Description>Epson TM T88V - Impresora de recibos - línea térmica - rollo 8 cm - hasta 300 mm/segundo - paralelo, USB</Description>
 <Category>
  <CategoryId>pos</CategoryId>
  <Description>Puntos de Venta</Description>
  <Subcategories>
   <CategoryId>pos.printer</CategoryId>
   <Description>Impresoras para Recibos</Description>
  </Subcategories>
 </Category>
 <InStock>0</InStock>
 <Price>
  <UnitPrice>4865.6042</UnitPrice>
  <CurrencyId>MXN</CurrencyId>
 </Price>
 <Manufacturer>
  <Description>Epson</Description>
  <ManufacturerId>eps</ManufacturerId>
 </Manufacturer>
 <Mpn>C31CA85814</Mpn>
 <Sku>PT910EPS27</Sku>
 <CompilationDate>2020-02-25T12:30:14.6607135Z</CompilationDate>
</Producto>
4
  • Could you provide the error message and perhaps a sample of your data? Commented Mar 3, 2020 at 20:18
  • The error message says: org.xml.sax.SAXParseException; lineNumber: 1365; columnNumber: 413; An invalid XML character (Unicode: 0x1) was found in the element content of the document. Commented Mar 3, 2020 at 20:23
  • 1
    What if before parsing the file you run through it and remove all invalid characters, would that work? Something like this stackoverflow.com/questions/45009271/… Commented Mar 4, 2020 at 1:36
  • That's exactly what I did, thanks a lot!! Commented Mar 4, 2020 at 2:17

2 Answers 2

1

The XML philosophy is that you don't process bad data. If it's not well-formed XML, the parser is supposed to give up, and user applications are supposed to give up. Culturally, this is a reaction against the HTML culture, where it was found that if it's generally expected that data users will tolerate bad data, the consequence is that suppliers will produce bad data.

Standards deliver cost reduction because you can use readily available off-the-shelf tools both for creating valid data and for reading it at the other end. The benefits are totally neutralised if you decide you're going to interchange things that are almost XML but not quite. If you were downloading software you wouldn't put up with it if it didn't compile. So why are you prepared to put up with bad data? Send it back and demand a refund.

Having said that, if the problem is "invalid Unicode characters" then it's possible that it started out as good XML and got corrupted in transit. Find out what went wrong and get it fixed as close to the source of the problem as you can.

Sign up to request clarification or add additional context in comments.

2 Comments

I'm actually trying to ignore those records so they won't be processed. I'm talking about 2 characters in a 10MB file, however those 2 characters are causing me to have only 100 products on the DB, instead of 8000. Sadly I have no control over these xml files they send. Of course I could edit manually the xml file on a daily basis in order to fix those characters, but that's not the solution my client needs.
They're not XML files. They're junk.
1

I solved it removing invalid characters of the xml file before processing it.

I couldn't do what I was trying to do (cath error and continue) but this workaround worked.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.