Is there a way for ignoring all xml parsing exceptions?

Question

I have to parse user documents which sometimes they are not well formed.It might contain spaces before tags or some other issue.how can I make them well formed or if this is'nt possible how can I ignore all exceptions? I also get exceptions about byte mark order because the document is in UTF-16 encoding but has no byte mark,and I can't add any because they are user files.

Okay,Can anyone tell me whats wrong with this sample data? (this is the note from device documentation : All the exchanges generated by this protocol will be carried out by using an XML file conform with the XSD described in this document.)

     <?xml version="1.0" encoding="UTF-16"?>
     <PROTOCOLE_HEMATO_BIOCODE InstrumentCode="2" InstrumentType="Diana 5 Evolution"   SerialNumber="Ns" Version="C4.06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <PROTOCOL_DATA>
     <RESULT>
     <INFORMATION>
     <PATIENT DoB="2011-08-03" FirstName="ALI" Location="" MedicalDoctor="" Name="NAVIDI" PatientCommentary="" PID="" RefTable="1" SID="1059"/>
     </INFORMATION>
     <DATAS DateTimeAnalyse="2011-08-03T11:36:11Z" IdOpAnalyse="Service" UnitsSytem="US">
     <PARAMETER IDParametre="0" LowerRefLimit="4" Nom="WBC" Statut_Limits="48" Units="K/µL" UpperRefLimit="10" Value="4.6"/>
     <PARAMETER IDParametre="1" LowerRefLimit="20" Nom="Lym%" Statut_Limits="48" Units="%" UpperRefLimit="45" Value="37.8"/>
     <PARAMETER IDParametre="2" LowerRefLimit="2" Nom="Mon%" Statut_Limits="48" Units1111="%" UpperRefLimit="8" Value="6"/>
     <PARAMETER IDParametre="3" LowerRefLimit="40"Nom="Neu%" Statut_Limits="48" Units="%" UpperRefLimit="75" Value="51.8"/>
     <PARAMETER IDParametre="4" LowerRefLimit="0" Nom="Bas%" Statut_Limits="48" Units="%" UpperRefLimit="3" Value="0"/>
     <PARAMETER IDParametre="5" LowerRefLimit="1" Nom="Eos%" Statut_Limits="48" Units="%" UpperRefLimit="7" Value="4.4"/>
     <PARAMETER IDParametre="7" LowerRefLimit="1.5" Nom="Lym#" Statut_Limits="48" Units="K/µL" UpperRefLimit="4.5" Value="1.7"/>
     <PARAMETER IDParametre="8" Nom="Mon#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.8" Value="0.28"/>
     <PARAMETER IDParametre="9" LowerRefLimit="2" Nom="Neu#" Statut_Limits="48" Units="K/µL" UpperRefLimit="7.5" Value="2.4"/>
     <PARAMETER IDParametre="10" Nom="Bas#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.2" Value="0"/>
     <PARAMETER IDParametre="11" Nom="Eos#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.6" Value="0.2"/>
     <PARAMETER IDParametre="21" LowerRefLimit="4.5" Nom="RBC" Statut_Limits="48" Units="M/µL" UpperRefLimit="6.2" Value="5.11"/>
     <PARAMETER IDParametre="22" LowerRefLimit="12" Nom="Hb" Statut_Limits="48" Units="g/dL" UpperRefLimit="18" Value="16.2"/>
     <PARAMETER IDParametre="23" LowerRefLimit="35" Nom="Hct" Statut_Limits="48" Units="%" UpperRefLimit="54" Value="48.8"/>
     <PARAMETER IDParametre="24" LowerRefLimit="80" Nom="MCV" Statut_Limits="51" Units="fL" UpperRefLimit="95" Value="95.5"/>
     <PARAMETER IDParametre="25" LowerRefLimit="27" Nom="MCH" Statut_Limits="48" Units="pg" UpperRefLimit="32" Value="31.7"/>
     <PARAMETER IDParametre="26" LowerRefLimit="32" Nom="MCHC" Statut_Limits="48" Units="%" UpperRefLimit="36" Value="33.2"/>
     <PARAMETER IDParametre="27" LowerRefLimit="11" Nom="RDW-cv" Statut_Limits="48" Units="%" UpperRefLimit="15" Value="10.6"/>
     <PARAMETER IDParametre="28" Nom="RDW-sd" Statut_Limits="48" Units="fL" Value="33.9"/>
     <PARAMETER IDParametre="29" LowerRefLimit="150" Nom="Plt" Statut_Limits="48" Units="K/µL" UpperRefLimit="500" Value="200"/>
     <PARAMETER IDParametre="30" LowerRefLimit="6" Nom="MPV" Statut_Limits="48" Units="fL" UpperRefLimit="10" Value="7.3"/>
     <PARAMETER IDParametre="31" Nom="Pct" Statut_Limits="48" Units="%" Value="0.15"/>
     <PARAMETER IDParametre="32" Nom="PDW" Statut_Limits="48" Units="%" Value="8.4"/>
     <PARAMETER IDParametre="33" Nom="Lx" Statut_Limits="48" Units=" " Value="20"/>
     <PARAMETER IDParametre="34" Nom="Ly" Statut_Limits="48" Units=" " Value="16"/>
     <PARAMETER IDParametre="35" Nom="Nx" Statut_Limits="48" Units=" " Value="59"/>
     </DATAS>
     <TRACABILITE IDOpValidation="" ModeleAnalyseur="Diana 5 Evolution" SerialNumber="" VersionCalcul="C4.06" VersionPackage="V6.26">
     <REACTIF ExpirationDate="2014-07-31" Lot="562" Product="HEMATON-5    "/>
     <REACTIF ExpirationDate="2014-05-04" Lot="12452" Product="HEMACORE    "/>
     <REACTIF ExpirationDate="2013-07-03" Lot="73049" Product="HEMALYSE-5    "/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="0" ParameterName="WBC"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="21" ParameterName="RBC"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="22" ParameterName="Hb"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="24" ParameterName="MCV"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="29" ParameterName="Plt"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="30" ParameterName="MPV"/>
     </TRACABILITE>
     <IMAGE DataSize="6676" ImageType="3">
     <IMAGE_DATA>AQAAA
     </IMAGE_DATA>
     </IMAGE>
     </RESULT>
     </PROTOCOL_DATA>
     </PROTOCOLE_HEMATO_BIOCODE>

Bas Slagter · Accepted Answer · 2011-09-14 08:00:52Z

1

You can write (or look on the internet for) an XML sanitizer method, class or library. Basically you need to clean up the XML line by line (removing spaces and such) before you can parse it correctly. Probably what you have now can't even be called XML.

answered Sep 14, 2011 at 8:00

Bas Slagter

9,9479 gold badges51 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

armin Over a year ago

in this case I'd better parse it as a simple text.sanitizing is not an option for me here.the tags are in place and matched.but some tag might have been left unmatched.i want to ignore those.

Bas Slagter Over a year ago

But can't you just open the file and read it line by line and than for every line you come across remove spaces, add tags that miss, etc? Don't see the problem. Maybe add some example XML?

Miserable Variable Over a year ago

Adding missing tags could be quite difficult, depending on the complexity of the xml. How would you determine whether something is a nested element or end tag is missing?

Bas Slagter Over a year ago

@hemal: Don't know...that's why he probably need to add some sample XML. If an end tag is missing you can however write some logic that adds it where it needs to be. Didn't say it was easy ;)

Miserable Variable Over a year ago

Agreed. I was only trying to give an extra data point in favor of your original suggestion to use a sanitizer. The user files can be cleaned up into a temporary file for parsing purpose. Of course, sanitizer will also have the same trouble of inserting end tags. Unless a schema is available, in which case using a sophisticated third-party is preferable over hand-sanitizing.

Justin · Accepted Answer · 2011-09-14 09:48:53Z

0

Just to be clear:

Just because something looks like XML doesn't mean that it is XML. If your document is not a well formed XML document then it isn't an XML document. From the specification:

A data object is an XML document if it is well-formed

If your document is not XML then you can't parse it using an XML parser

If it is just an encoding problem then you can specify the encoding when reading the file:

using (StreamReader reader = new StreamReader("myfile.xml", Encoding.Unicode))
{
    XmlDocument doc = new XmlDocument();
    doc.Load(reader);
}

The above will load the file "myfile.xml" with the UTF-16 format using the little endian byte order.

edited Sep 14, 2011 at 9:48

answered Sep 14, 2011 at 9:01

Justin

87.3k49 gold badges231 silver badges374 bronze badges

5 Comments

Sascha Hennig Over a year ago

I tend to disagree. If well formed XML would equal XML then there would be no need to call it well formed. So the term "well formed" implies that there indeed is a form of XML that aint well formed. Also, people would not say that code written in C# was not written in C# if it contains a single erronous line. Also, if Armin had not used the term XML we all wouldnt even know what the heck he is talking about.

Justin Over a year ago

@Sascha "XML document" is short for "Well formed XML document". The only reason we specify "well formed" is for all the people who seem to think that anything with angle brackets in it is an XML document.

Sascha Hennig Over a year ago

I know what you are getting at @Justin, and in this respect you are right. The important point tho is to distinguish between communication between machines or between humans. In the context of an application trying to analyze the document your statement is absolutely true (and a compiler would certainly complain if there is an error in what I believe to be C# code). I read it more in the lines of "Dont call it XML because the W3C Recommendation says otherwise." - but apparently you meant it the first way mentioned.

mschr Over a year ago

Hmm.. Thumbs down for having to stick byte orders and char sets in the face of a guy who has perfectly wellformed xml besides the fact that there is ONE space missing between parameters on line 12. The fact is, he is probably trying to address issues such as missing dtd's or namespaces.

Justin Over a year ago

@mschr Look at the timestamps - the XML sample didn't exist when I wrote this answer, the question was essentially "how can I parse XML which is not well formed"

Kaerber · Accepted Answer · 2011-09-14 08:49:54Z

0

You can try to use SAX for .NET, available at http://saxdotnet.sourceforge.net

It's not a document-parsing API, rather, tag-parsing, so it shouldn't throw exceptions on not-well-formed XML documents. But you'll have to write all the logic to process tags yourself.

answered Sep 14, 2011 at 8:49

Kaerber

1,66317 silver badges21 bronze badges

Collectives™ on Stack Overflow

Is there a way for ignoring all xml parsing exceptions?

3 Answers 3

5 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related