1

I have to parse user documents which sometimes they are not well formed.It might contain spaces before tags or some other issue.how can I make them well formed or if this is'nt possible how can I ignore all exceptions? I also get exceptions about byte mark order because the document is in UTF-16 encoding but has no byte mark,and I can't add any because they are user files.

Okay,Can anyone tell me whats wrong with this sample data? (this is the note from device documentation : All the exchanges generated by this protocol will be carried out by using an XML file conform with the XSD described in this document.)

     <?xml version="1.0" encoding="UTF-16"?>
     <PROTOCOLE_HEMATO_BIOCODE InstrumentCode="2" InstrumentType="Diana 5 Evolution"   SerialNumber="Ns" Version="C4.06" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <PROTOCOL_DATA>
     <RESULT>
     <INFORMATION>
     <PATIENT DoB="2011-08-03" FirstName="ALI" Location="" MedicalDoctor="" Name="NAVIDI" PatientCommentary="" PID="" RefTable="1" SID="1059"/>
     </INFORMATION>
     <DATAS DateTimeAnalyse="2011-08-03T11:36:11Z" IdOpAnalyse="Service" UnitsSytem="US">
     <PARAMETER IDParametre="0" LowerRefLimit="4" Nom="WBC" Statut_Limits="48" Units="K/µL" UpperRefLimit="10" Value="4.6"/>
     <PARAMETER IDParametre="1" LowerRefLimit="20" Nom="Lym%" Statut_Limits="48" Units="%" UpperRefLimit="45" Value="37.8"/>
     <PARAMETER IDParametre="2" LowerRefLimit="2" Nom="Mon%" Statut_Limits="48" Units1111="%" UpperRefLimit="8" Value="6"/>
     <PARAMETER IDParametre="3" LowerRefLimit="40"Nom="Neu%" Statut_Limits="48" Units="%" UpperRefLimit="75" Value="51.8"/>
     <PARAMETER IDParametre="4" LowerRefLimit="0" Nom="Bas%" Statut_Limits="48" Units="%" UpperRefLimit="3" Value="0"/>
     <PARAMETER IDParametre="5" LowerRefLimit="1" Nom="Eos%" Statut_Limits="48" Units="%" UpperRefLimit="7" Value="4.4"/>
     <PARAMETER IDParametre="7" LowerRefLimit="1.5" Nom="Lym#" Statut_Limits="48" Units="K/µL" UpperRefLimit="4.5" Value="1.7"/>
     <PARAMETER IDParametre="8" Nom="Mon#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.8" Value="0.28"/>
     <PARAMETER IDParametre="9" LowerRefLimit="2" Nom="Neu#" Statut_Limits="48" Units="K/µL" UpperRefLimit="7.5" Value="2.4"/>
     <PARAMETER IDParametre="10" Nom="Bas#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.2" Value="0"/>
     <PARAMETER IDParametre="11" Nom="Eos#" Statut_Limits="48" Units="K/µL" UpperRefLimit="0.6" Value="0.2"/>
     <PARAMETER IDParametre="21" LowerRefLimit="4.5" Nom="RBC" Statut_Limits="48" Units="M/µL" UpperRefLimit="6.2" Value="5.11"/>
     <PARAMETER IDParametre="22" LowerRefLimit="12" Nom="Hb" Statut_Limits="48" Units="g/dL" UpperRefLimit="18" Value="16.2"/>
     <PARAMETER IDParametre="23" LowerRefLimit="35" Nom="Hct" Statut_Limits="48" Units="%" UpperRefLimit="54" Value="48.8"/>
     <PARAMETER IDParametre="24" LowerRefLimit="80" Nom="MCV" Statut_Limits="51" Units="fL" UpperRefLimit="95" Value="95.5"/>
     <PARAMETER IDParametre="25" LowerRefLimit="27" Nom="MCH" Statut_Limits="48" Units="pg" UpperRefLimit="32" Value="31.7"/>
     <PARAMETER IDParametre="26" LowerRefLimit="32" Nom="MCHC" Statut_Limits="48" Units="%" UpperRefLimit="36" Value="33.2"/>
     <PARAMETER IDParametre="27" LowerRefLimit="11" Nom="RDW-cv" Statut_Limits="48" Units="%" UpperRefLimit="15" Value="10.6"/>
     <PARAMETER IDParametre="28" Nom="RDW-sd" Statut_Limits="48" Units="fL" Value="33.9"/>
     <PARAMETER IDParametre="29" LowerRefLimit="150" Nom="Plt" Statut_Limits="48" Units="K/µL" UpperRefLimit="500" Value="200"/>
     <PARAMETER IDParametre="30" LowerRefLimit="6" Nom="MPV" Statut_Limits="48" Units="fL" UpperRefLimit="10" Value="7.3"/>
     <PARAMETER IDParametre="31" Nom="Pct" Statut_Limits="48" Units="%" Value="0.15"/>
     <PARAMETER IDParametre="32" Nom="PDW" Statut_Limits="48" Units="%" Value="8.4"/>
     <PARAMETER IDParametre="33" Nom="Lx" Statut_Limits="48" Units=" " Value="20"/>
     <PARAMETER IDParametre="34" Nom="Ly" Statut_Limits="48" Units=" " Value="16"/>
     <PARAMETER IDParametre="35" Nom="Nx" Statut_Limits="48" Units=" " Value="59"/>
     </DATAS>
     <TRACABILITE IDOpValidation="" ModeleAnalyseur="Diana 5 Evolution" SerialNumber="" VersionCalcul="C4.06" VersionPackage="V6.26">
     <REACTIF ExpirationDate="2014-07-31" Lot="562" Product="HEMATON-5    "/>
     <REACTIF ExpirationDate="2014-05-04" Lot="12452" Product="HEMACORE    "/>
     <REACTIF ExpirationDate="2013-07-03" Lot="73049" Product="HEMALYSE-5    "/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="0" ParameterName="WBC"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="21" ParameterName="RBC"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="22" ParameterName="Hb"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="24" ParameterName="MCV"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="29" ParameterName="Plt"/>
     <FACTEUR_CALIBRATION FactorDate="2011-07-31" FactorValue="1" IDParametre="30" ParameterName="MPV"/>
     </TRACABILITE>
     <IMAGE DataSize="6676" ImageType="3">
     <IMAGE_DATA>AQAAA
     </IMAGE_DATA>
     </IMAGE>
     </RESULT>
     </PROTOCOL_DATA>
     </PROTOCOLE_HEMATO_BIOCODE>

3 Answers 3

1

You can write (or look on the internet for) an XML sanitizer method, class or library. Basically you need to clean up the XML line by line (removing spaces and such) before you can parse it correctly. Probably what you have now can't even be called XML.

Sign up to request clarification or add additional context in comments.

5 Comments

in this case I'd better parse it as a simple text.sanitizing is not an option for me here.the tags are in place and matched.but some tag might have been left unmatched.i want to ignore those.
But can't you just open the file and read it line by line and than for every line you come across remove spaces, add tags that miss, etc? Don't see the problem. Maybe add some example XML?
Adding missing tags could be quite difficult, depending on the complexity of the xml. How would you determine whether something is a nested element or end tag is missing?
@hemal: Don't know...that's why he probably need to add some sample XML. If an end tag is missing you can however write some logic that adds it where it needs to be. Didn't say it was easy ;)
Agreed. I was only trying to give an extra data point in favor of your original suggestion to use a sanitizer. The user files can be cleaned up into a temporary file for parsing purpose. Of course, sanitizer will also have the same trouble of inserting end tags. Unless a schema is available, in which case using a sophisticated third-party is preferable over hand-sanitizing.
0

Just to be clear:

  • Just because something looks like XML doesn't mean that it is XML. If your document is not a well formed XML document then it isn't an XML document. From the specification:

A data object is an XML document if it is well-formed

  • If your document is not XML then you can't parse it using an XML parser

If it is just an encoding problem then you can specify the encoding when reading the file:

using (StreamReader reader = new StreamReader("myfile.xml", Encoding.Unicode))
{
    XmlDocument doc = new XmlDocument();
    doc.Load(reader);
}

The above will load the file "myfile.xml" with the UTF-16 format using the little endian byte order.

5 Comments

I tend to disagree. If well formed XML would equal XML then there would be no need to call it well formed. So the term "well formed" implies that there indeed is a form of XML that aint well formed. Also, people would not say that code written in C# was not written in C# if it contains a single erronous line. Also, if Armin had not used the term XML we all wouldnt even know what the heck he is talking about.
@Sascha "XML document" is short for "Well formed XML document". The only reason we specify "well formed" is for all the people who seem to think that anything with angle brackets in it is an XML document.
I know what you are getting at @Justin, and in this respect you are right. The important point tho is to distinguish between communication between machines or between humans. In the context of an application trying to analyze the document your statement is absolutely true (and a compiler would certainly complain if there is an error in what I believe to be C# code). I read it more in the lines of "Dont call it XML because the W3C Recommendation says otherwise." - but apparently you meant it the first way mentioned.
Hmm.. Thumbs down for having to stick byte orders and char sets in the face of a guy who has perfectly wellformed xml besides the fact that there is ONE space missing between parameters on line 12. The fact is, he is probably trying to address issues such as missing dtd's or namespaces.
@mschr Look at the timestamps - the XML sample didn't exist when I wrote this answer, the question was essentially "how can I parse XML which is not well formed"
0

You can try to use SAX for .NET, available at http://saxdotnet.sourceforge.net

It's not a document-parsing API, rather, tag-parsing, so it shouldn't throw exceptions on not-well-formed XML documents. But you'll have to write all the logic to process tags yourself.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.