-2

I've got a pretty basic xml, for which I made an interface through the automatic generator in delphi 7. This was working fine, until I ran into some odd characters being sent my way. As an example:

<AfasGetConnector>
  <Medewerker>
    <Afstortnummer>0032123</Afstortnummer>
    <Naam>Wiaëröóíïúáäâtè</Naam>
  </Medewerker>
</AfasGetConnector>

Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it. To be exact: ë, é and ö will not be accepted. The rest however, are perfectly fine. (Even the capital versions Ë, É and Ö are fine)

This confuses me. Why would those 3 be illegal, but "ä" and most others be fine? Are there any others I should worry about?

The whole block is given to me in a CDATA,. so the initial transfer goes fine,. After that however, I need to pick through the individual "Medewerker" elements from the xml,. which are not encapsulated in the CDATA. Hence the issue.

14
  • 1
    Use unicode strings and the issues with accented chars go away. stackoverflow.com/questions/2281223/… Or even better port your project to a Unicode aware Delphi. Commented May 4, 2016 at 13:15
  • 1
    D7 is quite capable of dealing with Unicode/UTF8 for this specific task without the need of any Unicode extensions/libraries. You did not provide any code. as long as you use WideString to hold your strings and use a XML parser which supports Unicode (e.g. MSXML) there is no problem. Commented May 4, 2016 at 14:11
  • 1
    Also, "Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it." How did you "pull" it? did you save the XML file in Unicode/UTF8 format? does the XML has encoding headers? Commented May 4, 2016 at 14:14
  • 1
    I doubt D7 XML Data Binding supports Unicode (You did not mention that in your Q). You need to parse the XML yourself with IXMLDocument. Commented May 4, 2016 at 14:22
  • 1
    @kobik: Delphi's XML Data Binding is built on top of the IXMLDocument/IXMLNode interfaces, which support Unicode via a DOMString data type (an alias for WideString/UnicodeString), and always has. Commented May 4, 2016 at 17:32

1 Answer 1

2

Pulling this into Firefox / IE will quickly tell you that there's illegal characters in it.

Works fine for me. Neither Firefox nor IE complain about the characters at all.

This confuses me. Why would those 3 be illegal, but "ä" and most others be fine?

They are not illegal at all. The XML specification allows most Unicode codepoints to be used (minus non-printable control characters, UTF-16 surrogates, and reserved codepoints). All of the characters you have shown are legal.

The whole block is given to me in a CDATA,. so the initial transfer goes fine,. After that however, I need to pick through the individual "Medewerker" elements from the xml,. which are not encapsulated in the CDATA. Hence the issue.

You are likely encountering an encoding mismatch between what the XML parser thinks the XML is encoded as, and what the XML is actually encoded as. But since you have not provided the original raw bytes of the XML that was transferred, or the code that is trying to load and parse it, there is no way to know for sure what is actually happening.

Sign up to request clarification or add additional context in comments.

4 Comments

Seems you're right,. Chrome doesn't mind the .xml if it's set to utf-8 in notepad++, but cries wolf when it's set to ANSI. I can't access the header right now,. but will have to check it on monday. I suspect in lacking a header delphi switches to the OS default rather than utf-8
If the XML is encoded in any charset other than UTF-8, the charset must be explicitly declared in the encoding attribute of the XML's prolog (except for UTF-16, where a BOM will suffice). UTF-8 is the default charset per the XML specification if no other charset is specified. In ANSI, characters ë, é and ö are likely being encoded as bytes 0xEB, 0xE9, and 0xF6, respectively. Those are not valid UTF-8 byte sequences, so any conformant XML parser should complain about them if UTF-8 is assumed. The same is true of the other characters shown (ó í ï ú á ä â).
SoapUI tells me the raw xml is utf-8,. however, AFAS does not send an encoding declaration in the xml. If i manually prefix the xml with an encoding as utf-8 - i still get the same error. If however I prefix the encoding as iso-8859-1, then everything works fine, without any need for character replacing.
@Thomas then SOAP UI is wrong and the XML really is encoded as ISO-8859-1, which explains why other systems are complaining about it. Without an explicit encoding declaration, UTF-8 must be assumed, and the bytes I showed are not valid UTF-8.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.