0

I'm trying to parser a wrong XML code with XmlStringReader, like this one.

<Page CODE=""L"" page Caption=""Example""><Cell CellType="0"...></Cell></Page>

and with this code, I try to get the value from the cell type attribute in the Cell Tag.

        Using reader As XmlReader = XmlTextReader.Create(New StringReader(l.Label), New XmlReaderSettings With {
                                                     .ValidationType = ValidationType.None,
                                                     .XmlResolver = Nothing})
               While (reader.ReadToFollowing("Cell"))
            reader.MoveToAttribute("CellType")
            Select Case Int32.Parse(reader.Value)
                  ...
            End Select
        End While

So I get the following XmlException

'Caption' is an unexpected token. The expected token is '='

Are there any ways to avoid this exception? or Should I parse the xml before this to fix the attribute wrong written?

Thanks

2 Answers 2

3

Should I parse the xml before this to fix the attribute wrong written?

It's not XML. It's something which looks a bit like XML, but isn't really. Don't try to read non-XML with XML APIs. It will - and should - fail.

Ideally, fix whatever producing the pseudo-XML to start with.

Sign up to request clarification or add additional context in comments.

2 Comments

Yep, It's not. I read from a database table and also I don't have access to the producing method so I though the other way It's parsing it with Regular Expressions.
@HumbertoBarrientosGonzalez: I wouldn't skip straight from XML to Regex. You may well want to write a custom parser, which then converts it to XML on the fly. You'll need to try to find documentation for the format though.
0

The universal rule of parsers is that they assume the input is valid according to whatever spec the parser is written. In the case of an XML parser, then, it assumes you're passing it valid XML code to parse.

In this case, you're not because XML doesn't allow attributes to have spaces in their names. page Caption is not a valid attribute identifier, so the parser is probably interpreting page as the attribute identifier, treating the space as a delimiter, and wondering what to do with Caption.

You can't just "fix" the exception though. The parser is thoroughly confused, and it's giving up. Even if you could somehow force it to continue, there would be no way to guarantee the validity of the results. It's just like if someone went through a book and removed all of the punctuation. You'd probably put it down in frustration because you couldn't understand it. But if someone forced you to read it anyway, you'd probably end up getting the wrong meaning more often than not. The only way to fix the problem is to give the parser input that it understands.

So, yes, you'll need to ensure that the XML is valid before running it through a parser. Where are you obtaining this XML from? Can you fix the generation process so that it uses valid identifiers and conforms properly to an XML schema?

1 Comment

I'm reading from a database table. I can't, I'm coding to convert to XML but I don't know If the best approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.