Delphi xml parser, why are some characters illegal - but many are not?

Question

I've got a pretty basic xml, for which I made an interface through the automatic generator in delphi 7. This was working fine, until I ran into some odd characters being sent my way. As an example:

<AfasGetConnector>
  <Medewerker>
    <Afstortnummer>0032123</Afstortnummer>
    <Naam>Wiaëröóíïúáäâtè</Naam>
  </Medewerker>
</AfasGetConnector>

Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it. To be exact: ë, é and ö will not be accepted. The rest however, are perfectly fine. (Even the capital versions Ë, É and Ö are fine)

This confuses me. Why would those 3 be illegal, but "ä" and most others be fine? Are there any others I should worry about?

The whole block is given to me in a CDATA,. so the initial transfer goes fine,. After that however, I need to pick through the individual "Medewerker" elements from the xml,. which are not encapsulated in the CDATA. Hence the issue.

Use unicode strings and the issues with accented chars go away. stackoverflow.com/questions/2281223/… Or even better port your project to a Unicode aware Delphi. — Johan
– Johan, Commented May 4, 2016 at 13:15
D7 is quite capable of dealing with Unicode/UTF8 for this specific task without the need of any Unicode extensions/libraries. You did not provide any code. as long as you use WideString to hold your strings and use a XML parser which supports Unicode (e.g. MSXML) there is no problem. — kobik
– kobik, Commented May 4, 2016 at 14:11
Also, "Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it." How did you "pull" it? did you save the XML file in Unicode/UTF8 format? does the XML has encoding headers? — kobik
– kobik, Commented May 4, 2016 at 14:14
I doubt D7 XML Data Binding supports Unicode (You did not mention that in your Q). You need to parse the XML yourself with IXMLDocument. — kobik
– kobik, Commented May 4, 2016 at 14:22
@kobik: Delphi's XML Data Binding is built on top of the IXMLDocument/IXMLNode interfaces, which support Unicode via a DOMString data type (an alias for WideString/UnicodeString), and always has. — Remy Lebeau
– Remy Lebeau, Commented May 4, 2016 at 17:32

Remy Lebeau · Accepted Answer · 2016-05-04 17:42:17Z

2

Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it.

Works fine for me. Neither Firefox nor IE complain about the characters at all.

This confuses me. Why would those 3 be illegal, but "ä" and most others be fine?

They are not illegal at all. The XML specification allows most Unicode codepoints to be used (minus non-printable control characters, UTF-16 surrogates, and reserved codepoints). All of the characters you have shown are legal.

The whole block is given to me in a CDATA,. so the initial transfer goes fine,. After that however, I need to pick through the individual "Medewerker" elements from the xml,. which are not encapsulated in the CDATA. Hence the issue.

You are likely encountering an encoding mismatch between what the XML parser thinks the XML is encoded as, and what the XML is actually encoded as. But since you have not provided the original raw bytes of the XML that was transferred, or the code that is trying to load and parse it, there is no way to know for sure what is actually happening.

answered May 4, 2016 at 17:42

Remy Lebeau

610k36 gold badges516 silver badges875 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

T.S Over a year ago

Seems you're right,. Chrome doesn't mind the .xml if it's set to utf-8 in notepad++, but cries wolf when it's set to ANSI. I can't access the header right now,. but will have to check it on monday. I suspect in lacking a header delphi switches to the OS default rather than utf-8

Remy Lebeau Over a year ago

If the XML is encoded in any charset other than UTF-8, the charset must be explicitly declared in the encoding attribute of the XML's prolog (except for UTF-16, where a BOM will suffice). UTF-8 is the default charset per the XML specification if no other charset is specified. In ANSI, characters ë, é and ö are likely being encoded as bytes 0xEB, 0xE9, and 0xF6, respectively. Those are not valid UTF-8 byte sequences, so any conformant XML parser should complain about them if UTF-8 is assumed. The same is true of the other characters shown (ó í ï ú á ä â).

T.S Over a year ago

SoapUI tells me the raw xml is utf-8,. however, AFAS does not send an encoding declaration in the xml. If i manually prefix the xml with an encoding as utf-8 - i still get the same error. If however I prefix the encoding as iso-8859-1, then everything works fine, without any need for character replacing.

Remy Lebeau Over a year ago

@Thomas then SOAP UI is wrong and the XML really is encoded as ISO-8859-1, which explains why other systems are complaining about it. Without an explicit encoding declaration, UTF-8 must be assumed, and the bytes I showed are not valid UTF-8.

Collectives™ on Stack Overflow

Delphi xml parser, why are some characters illegal - but many are not?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related